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Statistics  and  Model  Construction 


Introduction 

Constructing  and  evaluating  behavioral  science  models  is  a 
complex  process.  Decisions  must  be  made  about  which  variables  to 
include,  which  variables  are  related  to  each  other,  the  functional 
forms  of  the  relationships,  and  so  on.  The  last  10  years  have  seen  a 
substantial  extension  of  the  range  of  statistical  tools  available  for 
use  m  the  construction  process.  The  progress  in  tool  development  has 
een  accompanied  by  the  publication  of  handbooks  that  introduce  the 
methods  m  general  terms  (Arminger,  Clogg,  &  Sobel,  1995;  Tinsley  & 
Brown,  2000a) .  Each  chapter  in  these  handbooks  cites  a  wide  range  of 
books  and  articles  on  specific  analysis  topics. 

Recent  developments  are  too  broad  to  cover  in  detail  in  a  single 
chapter.  Instead,  this  chapter  examines  current  statistical  modeling 
practices  from  a  particular  perspective:  How  can  available  statistical 
tools  be  translated  into  practices  that  improve  the  quality  of 
behavioral  science  models? 


Framing  the  Problem 

For  the  purposes  of  this  chapter,  a  high  quality  model  is  defined 
as  a  model  that  is  grounded  in  reliable  knowledge.  Reliable  knowledge 
is  produced  when  a  scientific  community  reaches  a  consensus 
interpretation  (Ziman,  1978),  developed  through  a  process  of 
principled  argument  (Abelson,  1995) .  The  acronym  "MAGIC"  summarizes 
^itical  elements  of  this  process.  M,  magnitude,  is  "...  the 
quantitative  support  for  the  qualitative  claim”  made  by  a  theory.  A, 
articulation,  is  "...  the  degree  of  comprehensible  detail  in  which 
conclusions  are  phrased."  G,  generality,  is  "...  the  breadth  of 
applicability  of  the  conclusions.”  I,  interestingness ,  is  hard  to 
define,  but  Abelson  (1995)  suggests  "...  to  be  theoretically  interesting 
it  must  have  the  potential,  through  empirical  analysis,  to  change  what 
people  believe  about  an  important  issue”  (italics  in  the  original). 

C,  credibility,  is  the  believability  of  a  research  claim.  Credibility 

depends  on  methodological  soundness  and  the  theoretical  coherence  of 
the  claim. 

Statistical  methods  contribute  to  reliable  knowledge  when  they 
promote  principled  argument.  This  chapter  examines  trends  in  data 
analysis  methods  from  that  perspective.  First,  the  chapter  will 
consider  construction  of  measurement  and  substantive  models.  These 
models  are  considered  separately  because  different  methods  (e.g., 
factor  analysis  vs.  regression)  traditionally  have  been  used  for' the 
two  purposes.  The  second  section  of  the  chapter  covers  methods  of 

m^laPpraiSal  and  This  section  begins  with  a  brief  review 

of  debates  related  to  significance  testing  and  then  considers  methods 
of  augmenting  this  practice  by  using  effect  sizes  (ESs),  confidence 
interval. s  (CTs),  and  goodness-of-f it  indices  (GFIs).  Issues  related  to 
defining  and  choosing  between  alternative  models  (e.g.,  confirmation 
las)  are  also  discussed.  Finally,  the  chapter  examines  qualitative 
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analysis  and  exploratory  data  analysis  as  methods  that 
to  extend  existing  models. 


can  be  applied 


Overview  of  Model  Construction  Methods 

The  following  examination  of  available  model  construction  methods 
is  organized  around  three  themes.  The  first  theme  is  a  dTst  “nctTon 
tween  measurement  models  and  path  models.  This  element  of  the 

assigned°toeoSeaSZrS  T  T  C  ““  m°St  data  analysis  Problems  can  be 
assigned  to  one  of  two  broad  categories.  Some  problems  involve 

construct  measurement.  other  problems  involve  determining  the 

relationships  between  different  constructs.  Methods  in  the  first 

ategory  produce  measurement  models.  Methods  in  the  second  categorv 

produce  substantive  models.  Substantive  models  describe  causal  paSs 

linking  different  constructs.  For  this  reason,  these  models  aJe 

sometimes  referred  to  as  path  models.  Table  1  illustrates  these 

categories  with  reference  to  specific  methods  covered  in  this 
discussion.  ^ 


The  intended  endpoint  of  most  behavioral  research  is  a  causal 

natural  in^his6331^  t*™  measurement  models  to  explanatory  models  is 
natural  m  this  context.  For  example,  early  research  on  Gulf  War 

effort?Stond  P°Stfraumatic  stress  disorder  included  substantial 
d  g  f  demonstrate  that  these  syndromes  existed.  Initial  efforts 

h!d  f^b  m°dels  to  resent  each  syndrome.  t5s  problem 

de  solved  before  moving  on  to  the  development  of  path  models 
Path  models  were  then  constructed  to  identify  causes  and  effects  of ' 
these  illnesses.  This  development  pattern  is  a  normal  sequence  that 

relatedStopics  atl°n  °f  measurement  and  Patl>  -odels  as  separate  but 

naturJTf  underVing  theme  of  this  discussion  focuses  on  the 

fundamental  dichotomy  by  characterizing  variables  as  -XJferences  in 
S3dSre:r,'dlf£e"er,CeS  “  kind'"  Constructs  that  involve  Spences 

diffe?encesrinrkiSdraret0  ?S  COTatinuous  measures;  scales  that  measure 
measures  b  referred  to  as  discrete  or  categorical 

variable "  °lfferent  analySiS  models  are  appropriate  for  each  class  of 
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Table  1.  Model  Construction  Methods 

Measurement  Models 


Dimensional  Models 

Exploratory 

Exploratory  factor  analysis  (EFA) 
Multidimensional  scaling  (MDS) 
Confirmatory 

Confirmatory  factor  analysis  (CFA) 

Categorical  Models 

Exploratory 

Exploratory  cluster  analysis  (ECA) 

Latent  class  analysis  (LCA) 

Confirmatory 

Expectation-Maximization  Mixture  Analysis 
Taxometrics 


Path  Models 


Dimensional  Models 

Exploratory 

Regression,  including  multiple  regression 
Analysis  of  variance  (ANOVA) 

Hierarchical  Linear  Models  (HLM)* 
Confirmatory 

Structural  equation  modeling* 


Categorical  Models 

Exploratory 

Categorical  and  limited  dependent  variables 
Conf irmatory 

Taxometrics 

Latent  class  analysis  (LCA) 


( CLDV ) *  * 


* Includes 
*  * Includes 
loglinear 


latent  growth  curve  analysis  (LGCA) 
logistic  regression,  logit  analysis 
analysis,  and  other  specific  models 


as  a  special  case. 
,  probit  analysis, 
as  special  cases. 


,the  discussion  of  analytic  methods  will  draw  a 
istinction  between  exploratory  and  confirmatory  analyses  This  thpmp 
focuses  on  the  constraints  imposed  on  a  model  PuS  explorato^ 

aJTanal13  the  minimurn  number  of  constraints  required  to  perform 

an  analysis.  Pure  confirmatory  analysis  completely  constrains  th l 
model  by  specifying  the  theoretical  construct  that  v  ve 

linking  each  observed  variable  exDliriMv  t-n  involved. 
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loadings  and  regression  coefficients  are  familiar  examples  of  the 
parameters  specified  within  confirmatory  models.  The  pure  forms  of 
exploratory  and  confirmatory  analysis  can  be  regarded  as  endpoints  of 
a  continuum.  The  typical  analysis  imposes  weak  constraints  For 
example,  the  number  of  factors  in  a  factor  analysis  may  be  fixed 
without  constraining  which  items  define  what  factors,  or  how  large  the 

dichot'nm  Ct°r  l0fingS  Wm  be'  ThS  exPl  oratory /confirmatory6 
dichotomy  is  a  reminder  of  the  endpoints  on  the  continuum.  Movement 

from  the  exploratory  extreme  toward  the  confirmatory  extreme  defines 
progress  within  a  research  domain.  aetmes 


Model  Construction 

Model  construction  has  theory  validation  as  its  ultimate 
VeCtlVe-  This  view  provides  a  framework  for  covering  the  full  range 
of  rssues  associated  with  model  construction.  Theories  involve  claims 
bout  causal  patterns.  These  claims  are  not  required  when  the  goal  is 
simply  to  predict  an  outcome.  Prediction  only  requires  a  statistical 
association  between  a  criterion  and  one  or  more  predictors; 
associations  do  not  necessarily  have  to  indicate  causal  relationships 
A  predictive  model  can  be  evaluated  simply  by  determining  whether if ' 
is  accurate  when  it  is  applied  to  new  samples  from  the  original 
population  ( i • e . ,  cross-validation)  or  to  samples  from  different 
populations  (i.e.,  generalizability) .  ~  ~ 

Theory  validation  imposes  additional  criteria.  The  associations 
among  variables  must  be  both  consistent  with  the  theory  being  tested 

^t™S^With  C°mPeting  thGOrieS-  AlSO'  Causal  -fe?encesed 

Measurement  models  are  required  for  testing  theories.  From  this 
perspective,  the  measurement  problem  is  to  determine  whether  a 
co  erent  measure  is  formed  by  the  observed  patterns  of  association 
among  hypothesized  indicators  of  a  construct.  it  is  necessary  to 
demonstrate  that  this  basic  assumption  is  justified  before  testing  any 
ypot  esis  about  the  relationship  between  the  measured  construct  and 
other  constructs.  For  example,  the  antecedents  and  consequences  of 
depression  cannot  be  studied  until  depression  itself  can  be  measured. 

tho  Glven  a  measurement  scale,  researchers  can  begin  to  investigate 
e  pattern  of  associations  between  the  measured  construct  and 
measures  of  other  relevant  constructs.  The  empirical  pattern  of 
associations  determines  scale  validity  (American  Psychological 
Association  [APA] ,  1985)  even  as  it  tests  theoretical  predictions 
concerning  the  pattern  of  associations  among  constructs  in  tMs  case 
the  predicted  pattern  of  associations  defines  a  substantive  model  ' 

“dels8h!re  to  h°rieS  addSeSS  thS  33108  tOPiC’  Several  substantive' 
models  have  to  be  compared  with  the  observed  pattern  of  associations 

^d4r?nrifStPo?Vi?eS  thmbirS  f°r  identifVng  the  most  reasonable 
light  °f  the  available  empirical  evidence. 
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The  full  process  of  model  construction  requires  the  development 
of  both  measurement  and  substantive  models.  The  methods  used  to 
construct  measurement  and  substantive  models  are  somewhat  different. 
Some  methods,  such  as  multiple  regression  and  factor  analysis,  are 
familiar  to  most  researchers.  Other  methods,  such  as  taxometrics  and 
categorical  and  limited  dependent  variable  (CLDV)  analyses,  are  less 
familiar.  It  is  critical  that  researchers  understand  the  range  of 
options  that  are  available  to  them.  This  understanding  will  permit 
the  investigator  to  mold  the  analysis  to  his  or  her  research  concerns 
As  a  general  principle,  research  issues  should  drive  the  researcher's 
choice  of  analysis  methods,  rather  than  vice  versa.  (For  additional 
discussion  of  problem-driven  vs.  methods -driven  research,  see  Ness  & 
Tepe,  this  volume.) 


Measurement  Models 


Differences  in  Degree 

Most  behavioral  constructs  are  conceived  of  as  differences  in 
degree,  and  their  measurement  scales  are  typically  developed  using 
exploratory  factor  analysis  (EFA)  and  confirmatory  factor  analysis 
(CFA) .  Gorsuch's  (1983)  readable  introduction  provides  guidance  on 
all  of  the  important  elements  of  EFA.  Fabrigar,  Wegener,  MacCallum, 
and  Strahan  (1999)  also  provide  a  summary  of  current  EFA  practices  and 
recommendations  for  improving  those  practices. 

A  trend  toward  the  use  of  CFA  instead  of  EFA  is  one  sign  of 
movement  toward  stronger  methods  in  behavioral  research.  The 
increasingly  frequent  use  of  CFA  has  been  accompanied  by  publication 
of  a  number  of  texts  that  provide  general  introductions  to  the  method. 
However,  introductory  texts  often  omit  important  topics  (Steiger, 
2001).  Bollen's  (1989)  text  remains  a  recommended  choice  for  an 
introduction  to  these  methods.  Boomsma  (2000)  and  McDonald  and  Ho 
(2002)  provide  similar  recommendations  with  respect  to  CFA. 

The  decision  to  employ  EFA  or  CFA  is  influenced  by  the  prior 
research  that  is  available  to  guide  model  development.  As  suggested 
by  its  name,  EFA  is  the  most  appropriate  analysis  option  when  little 
is  known  about  the  structure  of  a  domain.  For  example,  a 
questionnaire  constructed  to  measure  leadership  might  consist  of  a 
large  number  of  questions.  Investigators  could  reasonably  ask  whether 
various  types  of  behavior  described  by  the  questionnaire  items 
effectively  combine  to  define  a  single  overall  leadership  style. 
Alternative  models  would  divide  the  items  into  subsets  that  define  two 
or  more  distinct  leadership  styles.  EFA  can  be  used  to  compare  the 
un i d imens i ona 1  and  multidimensional  model  alternatives. 

Exploratory  Factor  Analysis  (EFA)  in  Practice .  Fabrigar  et  al 
(1999)  have  summarized  current  EFA  practices  from  the  perspective  of 
three  basic  decision  points  in  a  factor  analysis.  First,  a  method  of 
factoring  must  be  chosen.  Second,  the  number  of  factors  to  extract 
must  be  determined.  Third,  a  criterion  for  rotating  the  factors  to  a 
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firial  s°lution  must  be  chosen.  The  analyst  has  a  number  of  options  at 
each  decision  point.  The  most  common  choices  are  principle  components 

factors5  fllln  Kaiser'^  (1960)  criterion  to  select  the  number  of 

1999 T  CrZlt  Y  ^  °rthogonal  varimax  rotation  (Fabrigar  et  al . , 

.  Computer  programs  probably  contribute  significantly  to  this 

results ^  Etatistical  this  ana^ysS 

results  when  an  investigator  simply  provides  a  set  of  variables  for 

lysis  and  accepts  the  default  values  for  each  program  option 
m^deSa1choice.theref°re  ^  "Ch°°Se"  PCA  without  realizing  they  have 

fn_  .ad16  decision  regarding  how  many  factors  to  extract  is  important 

nf  fS  T'fTlatl°n  ^  testing‘  This  decision  defines  th^  number 

Obviouslv  h 1  c°nstructs . to  be  represented  in  the  measurement  model. 
Obviously,  this  decision  is  critically  important  in  modeling.  Thus 

it  is  noteworthy  that  Kaiser's  (1960)  criterion  (i.e.,  1  >  i  00)  often 
extracts  too  many  factors.  The  magnitude  of  this  problem  increases 
when  more  variables  are  analyzed  and/or  when  the  sample  size  is  small 
Bu3a  &  Eyuboglu ,  1992;  Cota,  Longman,  Holden,  FekkeS,  &  SnarL 

&  ^AtaSelU?C19?irir1Velicer^tw?6)aSi3rSiS  1 9  6  5 1  Humph^YS 

procedure  provide  better  guidance.  These  preSdure^have^ot^een 

Recently  pibliSedaSt  bG^USe  they  have  been  difficult  to  implement. 
Recently  published  computer  routines  can  solve  this  problem  (Kaufman  & 

Dunlap,  2000;  O'Connor,  2000).  Guidelines  based  on  Sonte  2arS 

studies  can  also  be  used  to  reduce  the  risk  of  extracting  too  manv 

lTssT*  Th^a  &  EYUb°®1“'  1992'  Cota  et  al.,  1993;  Lautenschlager 
) -  The  newer  methods  promote  parsimony  by  reducing  the  risk  of 
constructing  models  that  include  phantom  theoretical  construct  thft 

database^  ^  ChanCS  patterns  of  covariation  within  a  particular 

Overe^tra^t^ic^^as00|  **  n0t  be  a  maj°r  problem  for  EFA. 

uver ex tract ion  has  little  effect  on  the  structure  of  true  factors 

(Wood  Tataryn,  &  Gorsuch,  1996).  The  greater  problem  lies  in  wasted 

essential lv^cha"0  about  **e  meaning^f  what  a" 

.  y.  bance  findings.  Because  subjectively  plausible 

be  ?srd  for  £actor  Sf  riz: 

(  rmstrong  &  Soelberg,  1968),  mterpretability  is  not  a  good  guide  for 

indiZt retentl°n-  Decisions  based  on  sample  size  and  number  of 

(Maccallum  &vr  S°  °c  lirruted  value  m  avoiding  overex  tract  ion 
(Mac.allum,  Widaman,  Shang,  &  Hong,  1999) .  Factor  structure  is  more 

r?f^errd  *  thQ.^al^y  the  indicator  variable  men 
EFA  is  truiy  exploratory,  it  may  be  difficult  to  ensure  that  selected 
indicator  variables  meet  this  criterion. 

Fabrigar  et  al .  (1999)  concluded  their  review  of  the  state  of  the 
art  will  a  set  of  recommendations  that  would  improve  on  the  typical 
current  practice.  Principal  factors  analysis  should  replace  KA 

ro  a^n  ParaVrmCOrrelaC6d  £aCt°«>  -plSce  orthogonal 

otation.  Parallel  analysis  or  a  direct  measure  of  fit  between  the 
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factor  model  and  the  fit  of  the  model  to  the  data  (e.g.,  root  mean 
square  error  of  approximation  [RMSEA] )  should  replace  Kaiser's 
criterion.  Applying  these  methods  to  reanalyze  data  from  three 
studies,  Fabrigar  et  al.  (1999,  p.  2  91)  concluded  that  "...  an  EFA  with 
an  oblique  rotation  provides  much  better  simple  structure,  more 
interpretable  results,  and  more  theoretically  plausible 
representations  of  the  data  than  a  PCA  with  an  orthogonal  rotation." 

In  the  long  run,  getting  the  number  of  factors  right  in  EFA  is 
important,  but  not  essential.  Factors  that  are  the  product  of  chance 
associations  will  not  reproduce  in  other  samples.  When  somewhat 
similar  chance  factors  are  found  in  different  studies,  these  factors 
will  either  have  no  relationship  to  other  variables  or  will  show 
different  patterns  of  association  across  studies.  Either  outcome 
should  discourage  investigators  from  taking  those  factors  seriously 
when  constructing  theoretical  statements.  Extraneous  factors  will 
thus  be  weeded  out  in  cumulative  research  programs.  Nevertheless,  the 
research  process  is  made  more  efficient  when  supported  by 
intelligently  focused  analyses.  The  scientific  progress  is  slowed  by 
researchers  who  adopt  the  complacent  view  that  the  evidence  will  "sort 
itself  out"  in  the  long  run. 

Multidimensional  Scaling  (MDS) .  MDS  is  another  method  of 
establishing  the  dimensionality  of  a  measurement  space.  This  method 
can  be  applied  to  the  same  correlation  matrices  used  in  EFA.  However, 
the  interpretation  of  the  elements  of  the  matrix  is  different.  In 
EFA,  correlations  indicate  shared  causal  influences.  Viewing  EFA  as  a 
path  model,  the  correlations  between  indicator  variables  can  be 
estimated  by  multiplying  the  factor  loadings  (i.e. ,  the  path 
coefficients;  cf .  Kenny,  1979) .  Factor  analysis  procedures  produce 
loadings  that  reproduce  the  observed  correlations  as  well  as  possible 
within  specific  constraints. 


MDS  focuses  on  representing  similarity  between  cases  rather  than 
common  causal  influences.  MDS  solutions  define  locations  within  a 
reference  space.  Cases  that  have  similar  attributes  are  close 
together;  cases  that  are  dissimilar  are  farther  apart.  This  distance 
perspective  can  be  implemented  using  a  number  of  alternative  methods 
to  scale  distance.  No  matter  which  distance  measure  is  chosen,  the 
analysis  focuses  on  differences  between  the  dimensional  values  rather 
than  on  the  product  of  dimensional  values  (Davison,  1985;  Davison  & 
Sireci,  2000).  The  basis  for  setting  the  number  of  dimensions  is  the 
variance  (in  the  observed  distances)  that  is  explained  by  adding 
another  dimension  to  the  model .  The  model  is  complete  when  the 
addition  of  another  dimension  would  not  substantially  increase  the 
variance  explained.  Compared  with  EFA,  MDS  typically  requires  fewer 
dimensions  to  describe  relationships  among  entities  (Davison,  1985). 

Confirmatory  Factor  Analysis  (CFA) .  EFA  can  be  carried  out  with 
little  or  no  knowledge  of  the  likely  structure  of  the  data  to  be 
analyzed.  Advance  speci f ication  of  the  number  of  dimensions  is 
optional  in  EFA.  EFA  computes  loadings  for  all  items  on  all 
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dimensions.  in  EFA,  all  dimensions  are  either  orthogonal 
(uncorrelated)  or  oblique  (correlated) 

CFA  Places  a  greater  burden  on  the  investigator.  The  number  and 
nature  of  dimensions  in  the  model  must  be  specified  in  advance  CFA 
requires  at  least  an  informed  guess  to  support  each  of  the  three  basic 
actor  analysis  decisions.  The  investigator  must  specify  the  number 
of  latent  traits  to  measure,  designate  which  indicator  variables 
define  each  latent  trait,  and  specify  a  pattern  of  correlations 
between  latent  traits.  The  pattern  can  include  a  mixture  of 
orthogonal  and  oblique  dimensions.  The  model  can  be  specified  in 
greater  detail  by  assigning  specific  values  to  factor  loadings  and 
factor  correlations  (cf.,  for  example,  Arbuckle  &  Wothke  1999- 
Joreskog  &  Sorbom,  1981)  . 

Figure  1  illustrates  a  CFA  model  that  consists  of  hypothetical 
negative  affect  and  extraversion  constructs.  Ovals  represent  the 
hypothesized  latent  traits.  Rectangles  represent  measured  variables. 
Arrows  indicate  hypothesized  causal  effects  of  latent  traits  on 

v^riat>les.  The  numbers  next  to  the  arrows  are  the  equivalent 
of  CFA  factor  loadings  and  indicate  the  estimated  strength  of  the 
causal  effect.  The  two-headed  arrow  indicates  a  correlation  between 
latent  traits.  The  number  associated  with  the  arc  indicates  that  the 
correlation  is  moderate  in  magnitude. 

In  the  past,  CFA  required  the  analyst  to  define  parameter 
patterns  for  a  number  of  matrices.  Today,  CFA  is  much  more 
accessible.  This  is  underscored  by  the  fact  that  the  model  in  Figure  1 
was  constructed  simply  by  drawing  the  picture  and  then  linking  the 
picture's  components  to  variables  in  the  database.  The  figure  was 
constructed  using  two  different  computer  packages  (LISREL  and  Amos)  to 

method6  ^  COmmercial  Plages  share  this  simple 

method  of  model  construction.  Simplified  methods  have  certainly 

supported  the  more  frequent  use  of  CFA  in  behavioral  research. 

5°  use  of  this  method,  underlying  measurement  issues 

must  be  clearly  understood. 

Figure  1  illustrates  several  points.  First,  each  measured 
ble  receijes  an  arrow  from  just  one  of  the  two  latent  traits.  The 
other  latent  trait  might  exert  an  effect  on  each  of  the  indicators, 

zero  (i  °Lthefe  P°ssible  effects  have  been  fixed  explicitly  at 

constraint'  Tf  ^  '  Thl\  iS  an  example  of  how  CFA  imposes  a  model 

the  same  model  were  developed  using  EFA,  the  omitted 
arrows  would  appear  as  potential  causal  paths.  EFA  would  estimate  a 

?nSidS.WMCh  a11  P°SSlble  trait— asured  variable "rroL  are 
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Figure  1 

A  CFA  Model  for  Two  Personality  Traits 


Ihe  second  point  illustrated  by  Figure  1  concerns  the  correlation 
between  latent  traits.  This  correlation  is  moderately  large.  An 
alternative  model  could  have  been  defined  with  the  correlation  fixed 
at  zero.  The  model  in  Figure  1  corresponds  to  an  oblique  rotation  in 
EFA;  the  alternative  model  would  treat  neuroticism  and  extraversion  as 
orthogonal  factors.  In  this  case,  CFA  removed  a  constraint  that  is 
imposed  by  the  typical  default  EFA.  CFA  can  test  the  plausibility  of 
this  constraint  by  comparing  the  goodness  of  fit  of  the  orthogonal  and 
oblique  models.  Methods  for  this  comparison  are  described  in  the 
section  on  model  appraisal  and  amendment . 


The  correlation  between  latent  traits  also  can  be  used  to 
illustrate  the  flexibility  of  CFA.  The  correlation  in  Figure  1  could 
be  replaced  by  a  causal  effect.  For  example,  an  arrow  from  negative 
affect  to  extraversion  would  indicate  a  model  in  which  negative  affect 
caused  people  to  be  less  willing  to  interact  with  others.  Routine  use 
of  EFA  would  make  the  factors  orthogonal  (i.e.,  zero  correlation).  The 
correlation  between  latent  traits  could  be  introduced  in  EFA  by  using 
an  oblique  rotation,  but  EFA  cannot  impose  a  constraint  that  one 
latent  trait  is  the  cause  of  another.  Causal  interplay  among 
personality  variables  is  not  a  common  topic  of  investigation,  but  it 
might  be  useful  in  some  areas.  For  example,  patterns  of  perception  and 
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behavior  that  typify  neurotic  styles  may  have  an  underlying  causal 
pattern  (Shapiro,  1965). 

,  .  A.  third  P°mt  of  interest  regarding  Figure  1  requires  some 
amplification.  The  model  is  incomplete.  A  complete  model  would 
include  a  disturbance  for  each  measured  variable.  A  disturbance  term 
is  needed  to  reflect  the  fact  that  no  measured  variable  is  perfectly 
correlated  with  the  associated  latent  trait.  Imperfect  relationships 
are  expected,  given  that  the  measurement  process  generates  some  random 
error  variance.  The  disturbance  term  combines  random  error  and 

systematic  variance  from  variables  that  are  not  in  the  model  (James 
Mulaik,  &  Brett,  1982).  ' 

CFA  obviously  requires  more  thought  and  effort  than  EFA.  The 
payo  is  greater  flexibility  in  model  construction.  CFA  provides 
stronger  tests  of  models.  CFA  programs  estimate  parameters  (i.e. 
factor  loadings,  factor  correlations)  subject  to  the  specific  set'of 
constraints  imposed  by  the  investigator.  if  the  resulting  model 
accurately  reproduces  the  observed  pattern  of  covariation  between 
indicator  vanabies,  the  model  has  passed  a  riskier  test  than  that 
em  o  led  by  EFA.  The  test  is  riskier  because  the  EFA  can  adapt  the 
number  of  dimensions,  the  pattern  of  factor  loadings,  and  the  pattern 
of  factor  correlations  to  the  data.  By  contrast,  the  advance 
specification  of  these  attributes  of  the  model  required  in  CFA 
increases  the  chance  that  the  model  will  fail  to  reproduce  the 
observed  pattern  of  covariation.  Thus,  CFA  is  a  relatively  more 
demanding  test.  if  the  hypothesized  model  does  account  for  the  data, 
re^u  t  s  ould  be  interpreted  as  stronger  support  (Meehl,  1990a) 

CFA  also  provides  tools  for  directly  comparing  alternative 
theories.  This  aspect  of  CFA  is  relevant  when  theories  differ  with 
regard  to  the  range  of  behaviors  relevant  to  different  constructs  in 
the  model  In  such  cases,  different  theories  will  specify  different 
numbers  of  dimensions  and/or  patterns  of  factor  loadings  for  a  given 
o  indicator  variables.  These  differences  specify  different  CFA 
models.  The . competing  models  can  then  be  fitted  to  the  data  to 
eteraune  which  one  does  the  best  job  of  reproducing  the  observed 
patterns  of  correlation  or  covariance. 

Models  developed  to  deal  with  specific  measurement  issues 
illustrate  the  flexibility  and  power  of  CFA.  CFA  models  have  been 
developed  to  systematically  quantify  Campbell  and  Fiske's  (1959) 
conceptualization  of  convergent  and  discriminant  validity  (Lance 

1985  '  CP»“  n;  2002^  MarSh'  1989;  Marsh  s  Baile^  1991;  Widaman, 
rk  '  rIA  can  test  circumplex  models  (e.g..  Rounds  &  Tracey,  1993- 

Novick's  (1968?'Darfll  ’l  Can.tSSt  hyP°theses  derived  from  Lord  and 

(e.g.,  «illsfp,S,P^“:^ti99if11Valent/COn9enSriC  teSt  Nation 


.  ,  CFA  also  Provides  methods  of  addressing  other  recurring  themes  in 

beenmaaionastatdllteratUre’  ThS  generality  of  measurement  models  has 
been  a  longstanding  concern  (Blalock,  1982).  CFA  can  evaluate 
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generality  within  a  single  study  by  fitting  one  model  to  the  data 
gathered  from  two  or  more  groups  (e.g.,  males  and  females).  Separate 
models  then  can  be  fitted  for  each  group.  If  fitting  separate  models 
sach  group  does  not  substantially  improve  the  overall  fit,  the 
first  model  applies  to  all  groups.  CFA  can  also  be  used  to  test  the 
generality  of  previously  published  models  even  though  the  raw  data 
from  the  model  is  not  available  for  analysis.  Factor  loadings  from  a 
prior  analysis  can  be  used  to  define  the  model  to  be  fitted  to  data 
from  a  new  sample.  Good  fit  is  analogous  to  cross-validating  a 
regression  equation  (Browne,  2000)  . 

CFA  can  justify  the  use  of  standard  psychometric  models  (e.g., 
internal  consistency  estimates  of  reliability) .  These  models  apply 
when  indicator  variables  are  effects  of  the  underlying  trait;  these 
models  do  not  apply  when  the  indicators  are  causes  of  the  trait 

&  Lennox,  1991;  Edwards  &  Bagozzi,  2000)  .  Effect  indicators 
are  correlated  because  they  share  the  latent  trait  as  a  common  cause. 
Bo lien  and  Ting  (2000)  provide  a  method  of  determining  whether 
potential  indicators  are  cause,  effect,  or  a  mixture  of  the  two.  This 
procedure  may  help  clarify  the  structure  of  arguments  over  the  proper 
interpretation  of  a  scale. 

Factor  Analysis  and  the  Accumulation  of  Knowledge.  Reliable 
knowledge  should  be  cumulative.  From  this  perspective,  EFA  and  CFA 
can  play  complementary  roles.  EFA  is  most  useful  when  little  is  known 
about  a  measurement  domain.  Early  EFA  studies  provide  information 
that  can  be  used  to  formulate  theoretical  statements  that  are  the 
basis  for  later  CFA  studies.  CFA  studies  can  be  used  to  develop  and 
evaluate  a  sequence  of  models.  The  sequence  can  begin  with  relatively 
unconstrained  models  and  move  toward  models  that  specify  factor 
loadings  and  factor  correlations. 


Differences  in  Kind 

Methods  of  assessing  differences  in  kind  have  received  relatively 
less  attention  than  methods  of  assessing  differences  in  degree.  The 
greater  emphasis  on  dimensional  constructs  is  not  necessarily  an 
indication  that  these  constructs  are  more  important  than  categorical 
constructs.  Meehl  (1992)  points  out  that  whether  a  construct  is 
dimensional  or  categorical  is  not  a  matter  of  choice  or  preference. 

The  appropriate  characterization  is  an  empirical  issue.  From  this 

perspective,  the  preference  for  dimensional  measurements  may  be  a  form 
of  bias. 


Theoretical  discussions  that  invoke  typological  language  do  not 
always  indicate  that  a  typological  model  is  appropriate  or  intended. 
Sometimes  behavioral  researchers  use  typological  language  to  simplify 
their  communication.  For  example,  psychologists  may  contrast 
extraverts  and  introverts  to  illustrate  the  extremes  of  a  personality 
dimension.  In  other  cases,  typological  language  refers  to  real 
categorical  entities.  Psychiatric  classification  is  probably  the  best 
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known  example  of  this  usage,  and  illustrates  the  potential  for  tension 
tween  typological  and  dimensional  models.  Some  theorists  argue  for 

MeehT'iS(1992)massi  ^  &S  WGl1'  These  ar™ents  echo 

.  .  ]  992  ^  assertion  that  the  choice  between  alternatives  is  an 

betaken  airfare  categorical  language  should  not  always 

.  .  ace  value'  there  is  a  clear  need  for  sound  methods  of 

absoT1nrgiCate90riCal  variables  in  empirical  terms.  Good  methods  are 
absolutely  necessary  for  resolving  measurement  issues  in  some 
important  behavioral  domains. 

Military  researchers  may  be  interested  in  typological  variables 

EvLurS  re?rCh  rteXtS'  SCr8enin3  P^r^provide  one  S 

oovious  example.  it  can  be  wasteful  and  potentiallv  harmful  ^  h ...• 
recruits  who  have  personality  disorders.  Such  individuals  may  leave^ 

seZTce  tZ  ttlllt  ^  re^reTtlrZVlZlr 

£££: sound  ^ 

and  rFTAheeiXPli°rai0rY‘COnfirTnat0rY  continuum  used  above  to  contrast  EFA 
and  CFA  can  also  be  applied  to  differences  in  kind  The  confirm,™  h  «f 

tradit  ^  ^  meth°dS  °f  ClUSt0r  anal^sis-  These  methods 
of  fhP  r  VS  Sn  applied  with  few  restrictions  on  the  structure 

esultmg  typology.  Recent  advances  provide  methods  of 
imposing  restrictions  based  on  prior  research.  in  addition 
taxometnc  techniques  provide  alternative  methods  that  focus  on  the 

EuS?”  ELErific  types' such  as  th°se  £—  £- 

method  Ana^ys^-s •  Cluster  analysis  is  the  most  widely  used 

ethod  of  defining  categorical  variables  empirically.  Cluster 

TnZsTZTcZZ  °£  Cases  ,e'9"  Pe°Ple-  -gaSSuEnal 

StSuYSPir?Eoblein  to  de£ine  a  set  °f  °*»ps  tXZ 

profiles'  Second^the  aSSJgaed  to  the  same  Croup  must  have  similar 

Jubstantialfv  hpf'  proflle  f°r  the  average  case  must  differ 

bUDstantially  between  groups.  The  fi-rot-  r-r-i  ^  , 

SsLv“”H  “= = aSiSH" 

,™3 : :  v"s.= s~s  »■ 

groups  with  very  similar,  but  not  identical,  profiles. 

clusjrb:n^ 

Snrr i  1C^  agglomerative  or  direct),  a  similarity  index' (e  g 

and  d^Staace  wlth.or  without  standardizing  indicator  scores) 

rule  for  determining  group  membership  (e  g  average  linkvtt 
nearest  neighbor  linkarTo\  ^  ^  average  linkage  or 

facinr-  ana  ■  11r9e)-  h  '  cluster  analysis  is  similar  to 

of  the 
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Different  choices  may  lead  to  different  results,  so  researchers  must 
recognize  the  implications  of  the  choices  they  make  or  read  about  in 
research  reports.  Gore  (2000)  provides  an  overview  of  cluster 
analysis.  Aldenderfer  and  Blashfield  (1985)  provide  a  brief 
introductory  text.  Everitt,  Landau,  and  Leese  (2001)  provide  more 
detailed  guidance,  including  advice  about  computer  programs. 

The  results  of  exploratory  cluster  analyses  (ECAs)  must  be  viewed 
th  skepticism.  These  procedures  always  produce  clusters.  The 
problem  is  not  to  decide  whether  there  are  clusters,  but  rather  to 
determine  how  many  empirical  clusters  represent  real  and  distinct 
classes  in  the  population,  when  Milligan  and  Cooper  (1986)  reviewed 
)?qocfS  designed  to  helP  make  this  decision,  the  Hubert  and  Arable 
(  )  adjustment  to  the  Rand  (1971)  index  was  determined  to  be  the 

best  option  However,  no  single  index  is  accepted  today  as  the  best 
option  for  determining  the  number  of  clusters  (Gore,  2000)  One 

packages^  ^  ^  rUlSS  ^  ^  available  in  standard  computer 

Th,e  indlces  for  establishing  the  number  of  clusters  are  based  on 
the  analysis  of  a  single  set  of  data.  Replication  of  different  cluster 
solutions  across  samples  is  another  viable  method.  This  method  can  be 
pplied  even  m  a  single  set  of  observations  by  dividing  the  set  into 
subsets  (Overall  &  Magee,  1992),  but  results  Zst  be  untested 
carefully  (Krieger  &  Green,  1999) . 

defined  cluster  solutions  require  further  study  to 

auarfn^ee  gY'  -,N°  matter  which  rule  is  applied,  there  is  no 

initial  rif  the  resulting  clusters  will  have  any  real  meaning.  The 

vSidaHonUnfS^  fin^10n  Can  bS  a  Starting  P°int  subsequent 
before  rhS  h  T  typology,  but  construct  validation  is  a  necessity 
before  the  typology  is  accepted  as  meaningful . 

Recent  developments  provide  a  stronger  approach  to  cluster 
anaiysrs  (Eraiey  &  Raftery,  2002).  Newer  procedures  are  similar  to 

A  m  structure  because  they  provide  the  investigator  the  opportunity 
to  constrain  the  cluster  solution.  uppoicunicy 

Recent  developments  in  cluster  analysis  extend  Wolfe's  (1970) 
early  work  on  mixtures  of  normal  distributions.  Today  the 
investigator  can  specify  the  structure  of  a  group  by  defining  the 
means,  variances,  and  covariances  for  the  indicators.  The  number  of 
group  structures  specified  determines  the  numOoer  of  clusters  Tn  the 

™mtvIj;?1Vidbal  ,CaSeS  are  aSSigned  t0  clusters  ^  coining  the 
p  °bab^  Y_  f  membership  m  each  of  the  hypothesized  groups.  The 

lO^ri^aSDlied^r^011  ^  algorithrn  <cf-'  McLachlan  &  Krishnan, 
of IL ™  a  ?  ;  i  compute  these  probabilities.  The  goodness  of  fit 

EM  algorithm  “o  ^  15  indicated  X2  statistics  produced  bv  the 

nurr^e?  of  ^ups.  9n  1CanCS  “StS  are  an  °PCi°n  ^tennining  the 
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rFA  T^1]ei  ^  aPProac]l  to  cluster  analysis  is  a  clustering  analogue  of 
CFA '  01^er  procedures  relied  on  mathematical  criteria  to  define  the 

?r°UP\and  their  boundaries.  EM  uses  parameter  values 
tablished  by  the  analyst  to  define  the  groups.  Those  parameter 
values  can  be  based  on  theory  or  past  research.  Constraints  can  be 
imposed  to  explicitly  test  alternative  hypotheses  about  the  structure 
n  typ^opy-  m  the  future,  a  distinction  between  ECA  and 

f irmatory  cluster  analysis  may  become  as  common  as  the  distinction 
between  EFA  and  CFA.  ^bimcuon 


Latent  Class  Analysis  (LCA) .  LCA  also  defines  typologies  and  is 
u^ed  when  the  indicator  variables  being  analyzed  are  categorical  in 
uc  cases,  the  data  define  a  matrix  with  many  different  cells.  Each 
cell  represents  one  of  the  possible  score  profiles  for  the  cases  in 
the  sample . .  LCA  capitalizes  on  the  fact  that  cases  are  not  likely  to 

of  the^ata 'set "than*3  aCr°SS  possible  celIs-  Classes  are  subsets 
the  data  set  that  represent  particular  score  profiles  The 

rtS:  indicators'"8"  *  ““  ^  “  bY  «  subset 

As  an  example  of  the  LCA  problem,  consider  a  disease  that 
produces  eight  symptoms.  If  a  questionnaire  were  constructed  to 
determine  who  had  the  disease,  each  person  who  completed  thf 

andS-Bo-n?of  "“n1"  reEpond  "Yes"  £°r  each  symptom  that  was  present 
and  No  for  each  symptom  that  was  absent.  Taken  together  the  e-ieht 

questions  define  256  possible  symptom  combinations.  However,  suppose 

“15  hl°Ur  i  dlsease  symptoms  were  physical  and  four  weri 

psychological . .  Some  people  might  report  suffering  no  symptoms  at  all 
some  only  physical  symptoms,  some  only  psychological  symptoms  and 
some  might  report  having  all  eight  (physical  and  psychical) 

s',  th8Se  “ere,the  °nly  rSSPOnSe  P«terLYthat  Sere 

observed,  a  four-group  classification  would  be  a  reasonable 
representation  of  the  data.  The  analysis  problem  would  be  to  identify 
these  four  latent  classes  of  respondents.  identify 

,,  .  _LCA  t®sts  for  the  existence  of  clustering  in  situations  such  as 

that  described  above.  McCutcheon  (1987)  described  the  basic 
procedures  and  noted  explicitly  that  LCA  categories  do  not  necessarily 

TndinS  a  dimension-  Conditional  independence  is  fundamental  to  LCA 
independence  is  indicated  by  the  absence  of  a  correlation  between 

elementf  ^ndepfnj8""  indicators'  Conditional  independence  has  two 
ements.  Independence  is  conditional  if  the  indicators  are 

overall 'sample ' '"lc A "'“T  “  b“C  aPa  opp-la  ted  in  the 

classification  I  droops  observations  such  that  the  resulting 

classification  approaches  conditional  independence  as  closely  j 

“'"this  o8bWt°f  T  “thin'ClaSS  ValUSS  in  ont0dSe8xYofaho„ 

osely  this  objective  has  been  approximated. 


X' 


po:int:s  should  be  kept  in  mind  concerning  LCA.  First  a 

number  of1CcTaCsesesSt  £  ^  ^  f°T  Choosin^  the  appropriate 

Evaluation!  fn  discussed  later  in  this  chapter  (see  Model 

is  criterion  may  tend  to  produce  too  many  groups  when 
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large  samples  are  studied.  With  large  samples,  even  small  changes  in 
the  absolute  fit  of  the  model  to  the  data  will  be  statistically 
significant.  Second,  LCA  suffers  from  indeterminacy  problems. 
McCutcheon  (1987,  p.  25)  notes  that  multiple  solutions  may  provide 
equivalent  fit  to  the  data.  Recent  developments  extend  traditional 
LCA.  Magidson  and  Vermunt  (2002)  distinguish  LCA  from  latent  class 
factor  analysis  ( LCFA) .  LCFA  defines  categorical  equivalents  of  EFA 
factors.  Specifically,  each  LCFA  dimension  defines  a  categorical 
variable  with  two  or  more  levels.  Magidson  and  Vermunt  (2001) 
demonstrate  that  the  LCFA  models  can  be  more  parsimonious  than  LCA 
models  when  considered  at  the  level  of  the  overall  model.  Thus,  as  a 
simple  example,  three  parameters  are  needed  to  represent  four  groups. 
Only  two  parameters  are  needed  to  define  the  same  four  groups  based  on 
two  dichotomous  LCFs .  Greater  parsimony  is  indicated  by  the  fact  that 
fewer  parameters  are  needed  to  correctly  classify  individuals  (Popper, 
1959)  . 


Magidson  and  Vermunt  (2002)  have  also  shown  that  in  at  least  some 
instances,  LCA  can  classify  cases  more  accurately  than  K-means 
clustering,  which  is  the  more  common  direct  partitioning  method  of 
cluster  analysis.  In  discussing  their  findings,  Magidson  and  Vermunt 
(2002)  note  two  important  advantages  of  LCA  over  K-means  analysis. 

The  first  is  that  classification  involves  computing  the  probability 
that  each  case  is  a  member  of  each  group.  This  probability  then  is 
used  to  weight  the  case  when  computing  the  group  centroid,  which 
avoids  biased  estimates  that  may  be  derived  by  the  K-means  approach  of 
weighting  all  cases  equally.  Second,  the  LCA  approach  provides 
diagnostic  information  that  can  be  used  to  determine  the  number  of 
clusters.  These  x2  statistics  can  be  used  to  construct  GFIs.  These 
indices,  which  are  discussed  in  the  Model  Appraisal  section  of  this 
chapter,  provide  a  means  of  judging  the  accuracy  with  which  a  given 
model  reproduces  the  observed  distribution  of  cases  across  the  cells 
in  the  cross-classification. 

Taxometrics.  Taxometric  procedures  (Meehl,  1992,  1995;  Waller  & 
Meehl,  1998)  are  another  method  of  testing  the  claim  that  distinct 
groups  exist  within  a  population.  This  method  focuses  explicitly  on  a 
two-group  solution.  The  hypothesized  groups  are  a  target  group  to  be 
identified  (i.e.,  taxon)  and  all  others.  Indicators  of  taxon  status 
are  assumed  to  be  uncorrelated  within  each  group  (i.e.,  locally 
independent) .  Between-group  differences  produce  correlations  between 
the  indicators  in  the  general  population.  Taxometric  procedures  test 
for  the  existence  of  the  hypothesized  taxon  and  the  base  rates  for 
that  group  and  its  complement. 

The  taxometric  approach  was  developed  initially  in  the  context  of 
a  model  of  schizotypy,  but  this  technique  subsequently  has  been 
applied  to  other  concepts  such  as  "Type  A"  behavior  (Strube,  1989).  A 
recent  simulation  by  Beauchaine  and  Beauchaine  (2002)  compared 
taxometric  procedures  to  K-means  cluster  analysis.  Taxometrics, 
specifically  the  maximum  covariance  procedure,  was  more  effective 
"...when  the  number  of  indicators  was  few,  when  effect  sizes  were 
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reduced,  when  nuisance  correlations  were  high,  and  when  base  rates 
were  low’’  (p .  256).  This  comparison  is  limited  by  the  use  of  a  single 
taxometric  procedure;  convergence  of  several  methods  is  preferred 
(  eehl,  1992,  1995).  On  the  other  hand,  the  use  of  EM-based 
clustering  methods  (Fraley  &  Raftery,  2002)  also  might  have  changed 

T^1S  possiblllty  is  supported  by  a  recent  simulation 
study  that  found  taxometric  procedures  and  latent  mixture  modeling 

20nnr°rHtO  direc\ cluster  analysis  (Cleland,  Rothschild,  &  Haslam, 
2000).  However,  the  use  of  a  poor  criterion  for  deciding  the  number 
of  clusters  may  have  influenced  these  results. 

Typological  Analysis  and  the  Accumulation  of  Knowledge . 

Cumulative  knowledge  is  evident  when  a  research  domain  moves  from 

r™?rat02T  anaJYsls  to  confirmatory  analysis.  Recent  developments  in 
typological  analysis  procedures  provide  tools  to  move  the 

TradiM^10?  °5  "dif  f<frences  in  kind"  toward  confirmatory  modeling. 
Traditional  cluster  analysis  and  LCA  procedures  are  largely 

exploratory  since  they  provide  only  limited  opportunities  for  the  data 
analyst  to  constrain  solutions.  Both  EM  cluster  analysis  and 

co^tra^nChhPr°Vad?  gr6ater  opportunities  for  using  prior  knowledge  to 
constrain  the  models.  LCFA  analyses  are  primarily  exploratory,  but 

they  are  noteworthy  because  they  provide  alternative  models  that  can 
be  contrasted  with  LCA  results . 

The  previous  comments  on  factor  analysis  and  knowledge 

rnST  c°VPPlY  t0  the  constructi°n  Of  typological  measurement 
odels .  Confirmatory  models  require  parameter  specifications. 

Confirmatory  models  represent  cumulative  knowledge  when  the  parameter 
values  have  been  derived  from  prior  research.  A  priori  paraSetS 
specification  implies  the  same  risk  found  in  confirmatory  tests  of 
imensional  models.  Results  consistent  with  the  model  in  spite  of  the 
risk  provide  stronger  support  for  the  model  than  the  relatively 
qualitative  evaluations  provided  by  exploratory  methods.  The  strategy 
of  combining  confirmatory  methods  and  exploratory  methods  is  good 
practice.  This  approach  combines  consistency  (i.e.,  similarity  to 

prior  findings)  testing  with  an  exploratory  search  for  better 
alternatives. 

Measurement  MAGIC 


models  hi  l  t  9  discussion  of  methods  of  developing  measurement 
!nt  ha?  ernPhasized  movement  from  exploratory  to  confirmatory 

h  in,109'63;  !r?m  ^  initial  unstructured  exploratory  analysis 
final,  completely  specified  confirmatory  model  should  be 

Sh°mriediY/  general  increase  in  Abelson's  (1995)  MAGIC  criteria 
Each  element  of  MAGIC  is  briefly  considered  here. 

field  Tn?  *r7aitUde  COmponent  of  conceivably  could  decline  as  a 

final  confirmatnr0greaS?S  fr0m  early  exP! oratory  measurement  models  to 
.  .  irmatory  models.  Early  models  can  account  for  more  of  the 

chancenin?hon  C°Variatlon  or  similarity  measures  by  capitalizing  on 

he  a  prion  specification  of  parameter  values  eliminates  the 
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opportunity  to  incorporate  chance  into  final  models.  Final  models  can 
be  expected  to  exclude  minor  systematic  sources  of  covariance  and 
similarity  (MacCallum,  2003).  A  final  model  will  be  acceptable  if  it 
captures  the  major  systematic  sources  of  variation  in  the  indicator 
variables  even  if  the  overall  explanatory  power  of  this  model  is  less 
than  the  apparent  explanatory  power  that  exploratory  models  achieve  by 
capitalizing  on  chance. 


The  articulation  of  measurement  models  clearly  increases  with  the 
progression  from  exploratory  models  to  confirmatory  models.  A  graph 
such  as  that  in  Figure  1  provides  one  way  of  thinking  about  model 
articulation.  The  model  in  Figure  1  is  relatively  simple  because  the 
number  of  latent  traits  is  restricted  and  because  some  potential 
trait-indicator  relationships  have  been  ruled  out.  This  simplification 
is  articulation  in  the  sense  that  it  specifies  particular  traits  and 
separates  potential  causal  paths  into  those  included  in  the  model  and 
those  that  either  do  not  exist  or  are  small  enough  to  be  ignored.  For 
example,  previous  research  by  Costa  and  McCrae  (1992)  and  others  could 
have  been  used  to  specify  the  factor  loadings  for  the  model  in  Figure 
1.  Had  this  been  done,  the  model  would  have  been  fully  specified  in 
advance  of  the  analysis.  The  fully  specified  model  would  have  provided 
complete  articulation.  Every  latent  trait,  every  causal  effect,  and 
every  parameter  value  would  have  been  specified  a  priori.  if  one 
compared  a  graph  of  the  a  priori  specifications  for  EFA  and  CFA,  the  a 
priori  definition  for  Figure  1  in  EFA  would  consist  solely  of  the  list 
of  variables  at  the  sides  of  the  figure.  There  would  be  no  ovals  or 
arrows.  The  additional  specification  of  latent  traits  and  causal 
effects  in  Figure  1  is  a  pictorial  manifestation  of  model 
articulation . 


Generality  may  or  may  not  increase  in  the  progression  from 
exploratory  to  confirmatory  models .  While  generality  is  usually 
assumed  for  new  constructs,  this  attribute  should  be  empirically 
established  in  the  process  of  developing  the  measurement  model 
(Blalock,  1982).  CFA  provides  tests  for  generality  that  may  extend  to 
categorical  analyses.  The  CFA  methods  for  assessing  generality  rely 
on  goodness  of  fit  evaluations  that  could  be  readily  adapted  to  the 
mixture  approach  to  cluster  analysis.  However,  tests  for  generality 
may  indicate  that  different  models  are  needed  for  different 
populations.  If  so,  models  should  reflect  that  fact  rather  than 
treating  different  groups  as  equivalent. 


Interest  may  be  a  constant  in  the  sequence.  Initial  results 
often  are  interesting  because  they  are  novel  and  shed  light  on 
previously  unknown  territory.  This  interest  should  be  tempered  by  an 
appreciation  of  the  possibility  that  the  findings  are  the  product  of 
chance.  This  interest  also  should  be  tempered  by  the  realization  that 
the  results  probably  rest  on  assumptions  that  need  to  be  tested. 

After  initial  model  development,  the  clarity  with  which  an  analysis 
contrasts  competing  models  may  determine  its  interest  value  (Dixon  & 
O'Reilly,  1999)  .  As  models  mature,  verification  that  a  final  model 
fits  the  data  from  a  new  sample  should  be  interesting.  However,  at 
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this  point,  the  strongest  evidence  might  be  the  a  priori  prediction  of 
the  parameter  values  (i.e.,  the  pattern  of  factor  loadings)  for  a  new 
indicator.  This  a  priori  prediction  would  be  the  type  of  "darned 
strange  coincidence"  (Salmon,  1984)  that  provides  support  for  a 
theoretical  model.  Given  these  suggestions,  interest  in  a  field  will 
be  maintained  by  novel  predictions  and/or  progression  in  articulating 
and  contrasting  alternative  models. 

Credibility  should  be  greater  for  models  that  are  near  the 
confirmatory  end  of  the  model  development  sequence.  This  exploration 
should  consider  a  range  of  plausible  interpretations  and  alternative 
models  for  the  constructs  of  interest  and  for  mappings  of  indicators 
onto  constructs.  Credibility  is  enhanced  by  the  use  of  confirmatory 
techniques  to  determine  whether  the  indicators  are  effect  variables, 
to  rule  out  interpretations  based  on  methods  variance,  and  to 
demonstrate  convergent  and  discriminant  validity.  The  bey  to 
credibility  is  the  model's  cumulative  track  record  (Meehl,  1990a). 

The  final  model  should  be  the  one  that  best  accounts  for  the  data  from 
multiple  studies. 


Path  Models 

Path  models  describe  relationships  among  theoretical  constructs 
(McDonald  &  Ho,  2002) .  The  methods  available  to  construct  path  models 
regression  and  analysis  of  variance  (ANOVA)  .  These  procedures 
represent  a  subset  of  the  possible  methods  for  constructing  and 
testing  path  models.  The  extensive  range  of  alternatives  available  20 
years  ago  (Andrews,  Klem,  Davidson,  O'Malley,  &  Rodgers,  1981)  has 
grown  (Andrews  et  al . ,  1998).  This  section  will  consider  a  shift  in 
the  rationale  for  choosing  among  alternative  methods  and  will  then 
examine  structural  equation  modeling  (SEM)  and  hierarchical  linear 
modeling  (HLM)  as  procedures  that  are  likely  to  be  used  with 
increasing  frequency  in  the  future.  Latent  growth  curve  analysis 
( LGCA )  and  the  analysis  of  CLDV  are  also  considered  as  methods  that 
simplify  the  approach  to  key  modeling  issues.  These  methods  can  be 
implemented  in  the  context  of  SEM  or  HLM. 

Selecting  a  Modeling  Method 

Twenty  years  ago,  level  of  measurement  issues  might  have  been  a 
primary  consideration  when  selecting  an  analysis  procedure  (Andrews  et 
al.,  1981).  The  phrase  "level  of  measurement"  refers  to  the 
information  content  of  the  variables  used  in  analyses.  According  to 
Nunnally  and  Bernstein  (1994),  nominal  measures  assign  observations  to 
categories  that  have  no  intrinsic  ordering  (e.g.,  male,  female). 
Ordinal  variables  indicate  relative  magnitude  (e.g.,  greater  than, 
less  than)  for  an  attribute.  Interval  measures  indicate  order  and  the 
distance  between  observations.  Ratio  measures  order  observations, 
indicate  distance  between  observations,  and  indicate  distance  from  a 
zero  value. 
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Level  of  measurement  can  be  a  guide  to  selecting  analysis 
procedures.  For  example,  Pearson  product  moment  correlations  would  be 
computed  to  describe  relationships  between  interval  measures,  but 
Spearman's  rank  order  correlation  would  be  computed  to  describe 
relationships  between  ordinal  measures .  This  approach  to  choosing  an 
appropriate  statistic  can  be  extended  to  a  wide  range  of  analysis 
problems.  Doing  so  produces  a  complex  decision  tree  with  branches 
that  often  culminate  in  little  known  statistics  (Andrews  et  al., 

1981).  Today,  this  approach  is  made  easier  by  statistical  software 
routines  that  make  once  obscure  statistics  more  readily  available. 

Current  trends  shift  the  emphasis  from  data  characteristics  to 
the  nature  of  the  construct  being  studied.  If  the  constructs  being 
studied  are  continuous  variables,  analyses  are  chosen  to  obtain  the 
best  estimate  of  the  population  correlation  that  can  be  derived  from 
the  data.  This  focus  is  maintained  regardless  of  the  level  of 
measurement.  For  example,  consider  the  problem  of  estimating  the 
relationship  between  two  continuous  constructs  when  one  has  been 
measured  with  an  ordinal  scale  and  the  other  with  an  interval  scale. 
The  level  of  measurement  might  lead  to  the  use  of  a  rank  order 
correlation  because  rank  order  information  is  the  least  common 
denominator  for  the  two  measures.  However,  if  both  constructs 
theoretically  are  differences  in  degree,  a  polyserial  correlation  is 
an  appropriate  estimate  of  the  true  population  correlation.  Some 
standard  analysis  packages  now  would  provide  this  theoretically 
appropriate  estimate  as  a  default  (Joreskog  &  Sorbom,  1996) .  In 
effect,  the  information  provided  by  the  scales  is  used  to  obtain  the 
best  possible  estimate  of  the  theoretically  relevant  population 
parameter  regardless  of  the  level  of  measurement. 

Structural  Equation  Modeling  (SEM) 

SEM  techniques  have  been  so  widely  applied  that  a  detailed  review 
of  their  uses  would  be  unmanageable  in  any  brief  format.  Bentler  and 
Dudgeon  (1996)  and  MacCallum  and  Austin  (2000)  provide  recent 
overviews  of  these  methods.  Investigators  who  are  just  beginning  to 
use  SEM  can  choose  from  a  growing  number  of  texts.  However,  the  choice 
should  be  made  carefully.  Many  introductory  texts  gloss  over  critical 
considerations  (Steiger,  2001).  Bollen  (1989)  provides  a  sound 
introduction  to  the  general  method.  The  current  chapter  will  limit 
consideration  of  SEM  to  specific  recent  developments  that  are 
particularly  relevant  to  the  range  and  content  of  models  constructed 
using  this  tool. 

Modeling  Interactions .  Interactions  occur  when  the  relationship 
between  two  variables  is  contingent  on  one  or  more  other  variables. 

For  example,  an  investigator  might  be  interested  in  determining 
whether  the  relationship  between  general  intelligence  and  job 
performance  was  the  same  in  different  occupations.  If  the  physical 
demands  of  different  occupations  were  known,  the  question  might  be 
whether  intelligence  has  less  effect  on  performance  in  physically 
demanding  j  obs . 
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The  use  of  SEM  to  model  interactions  is  analogous  to  more 
familiar  procedures.  Analysis  of  covariance  (ANCOVA)  is  one  method  of 
investigating  interactions.  Nonparallel  regression  lines  illustrate 
the  essence  of  an  interaction.  Consider  a  two-group  analysis.  In 
this  case,  a  significant  result  in  a  test  for  nonparallelism  of 
regression  lines  means  bn  i=  bi2  (bi-,  is  the  slope  of  a  regression  line 
m  the  jfcl  group) .  Moderated  regression  (Saunders,  1956)  is  another 
method.  In  this  case,  y  =  b0  +  biXi  +  b2x2  +  b-Jx1x2 .  The  contingency  in 
this  case  can  be  illustrated  by  considering  that  the  slope  of  the 
regression  for  x0  is  b0  when  x2  =  0,  bi  +  b3  when  x2  =  1,  and  bi  +  2b3 
when  x,  2  . 


Agreeableness 


Figure  2 

Example  of  Moderated  Regression  Result 


Figure  2  illustrates  moderated  regression.  Gelattly  and  Irving 
(2001)  tested  the  hypothesis  that  job  autonomy  moderates  the  effects 
of  personality  on  job  performance.  The  hypothesis  was  supported  for 
the  Agreeableness  scale  of  the  NEO-FFI  (Costa  &  McCrae,  1992).  Figure 
2  illustrates  the  resulting  interaction  by  plotting  the  regression  of 
performance  on  agreeableness  for  high,  medium,  and  low  autonomy.  An 
interaction  is  indicated  by  the  fact  that  the  lines  are  not  parallel. 
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The  lines  would  be  parallel  if  performance  were  simply  an  additive 
function  of  agreeableness  and  autonomy.  The  nonparallel  lines  mean 
that  the  relationship  between  personality  and  performance  depends  on 
the  level  of  autonomy.  This  contingency  indicates  an  interaction. 

The  moderated  regression  approach  was  not  always  accepted  as  a 
legitimate  method  of  modeling  interactions.  A  debate  on  the  use  of 
this  method  once  focused  heavily  on  how  to  provide  appropriate 
statistical  significance  tests.  Of  central  concern  was  the 
confounding  of  main  effects  with  interaction  effects  (i.e., 
collinearity) .  Cohen  (1978)  facilitated  widespread  use  of  the  method 
by  showing  that  analyses  produced  the  same  conclusion  regarding  the 
presence  of  an  interaction  whether  or  not  special  steps  (e.g 
standardizing  the  variables)  were  taken  to  reduce  this  problem.  in 
this  special  case,  a  crucial  element  of  the  principled  debate  could  be 
reso  ved  in  purely  mathematical  terms.  Such  definitive  argument 
resolutions  are  rare  in  behavioral  research. 

SEM  includes  analogues  to  both  the  ANCOVA  and  moderated 
regression  methods  of  modeling  interactions.  Subgroup  models  provide 
the  SEM  equivalent  of  the  ANCOVA  approach.  Subgroup  analysis  divides 
samples  into  groups  based  on  some  characteristic (s) ,  such  as 
occupation  or  gender.  Two  versions  of  an  SEM  then  are  fitted  to  the 
data  in  each  group.  One  version  constrains  the  model  parameters  to  be 
equal  for  all  the  groups  in  the  analysis.  This  step  is  equivalent  to 
conducting  ANCOVA  assuming  that  a  single  regression  line  applies  to 
all  groups.  Other  versions  permit  differences  in  parameter  values 
across  groups . .  if  removing  the  equality  constraints  yields 
substantially  improved  predictive  accuracy  relative  to  the  first 
model,  model  parameters  cannot  be  considered  equal  in  the  groups. 
lh-!esult  1S  equivalent  to  finding  nonparallel  regression  lines  in 


The  SEM  analogue  of  moderated  regression  creates  cross-product 
terms  by  multiplying  scores  on  indicator  variables.  This  method  of 
testing  for  interactions  is  common  in  some  areas  of  research  (e.g 
industrial-organizational  psychology) .  A  debate  has  been  in  progress 
in  the  SEM  literature  since  Kenny  and  Judd  (1984)  first  introduced  the 

vS1C/?on5?  ■  SubsecJuent  work  bY  Ping  (1995,  1996)  and  Joreskog  and 
ang  (1996)  simplified  the  implementation  of  the  original  method. 
Topics  under  discussion  today  include  how  many  interaction  indicators 
must  be  included  in  a  model  and  how  to  differentiate  curvi linearity 

IhT  ^  1"lteracpon-  Significance  tests  in  this  area  are  sensitive  to 
the  underlying  distributions  of  the  variables.  Schumacker  and 
Marcoulides  (1998)  summarized  available  methods  and  the  ongoing  debate 
about  how  to  best  specify  and  estimate  interactions. 

Interactions  and  curvilinear  relationships  almost  certainly  will 
continuing  concerns  m  theory  development  and  modeling.  The 
current  state  of  the  art  does  not  provide  definitive  direction 
regarding  the  best  method  of  addressing  these  issues.  Steiger  (2001) 
recommends  reading  Rigdon,  Schumacker,  and  Wothke  (1998)  and  Joreskog 
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(1998)  (Schumacker  &  Marcoulides,  1998)  as  context  for  other  recent 
works  in  this  area.  Investigators  may  also  wish  to  consider  issues 
related  to  methodological  limitations  of  applied  research  that  affect 
the  results  from  moderated  regression  analyses  (McClelland,  1997; 
McClelland  &  Judd,  1993;  Russell  &  Bobko,  1992;  Russell,  Pinto,  & 
Bobko,  1991) .  These  limitations  may  be  even  more  problematic  than  the 
purely  statistical  issues. 

At  present,  the  multiple-group  approach  is  the  preferred  option 
for  modeling  interactions  in  SEMs .  Applications  of  this  approach  can 
be  viewed  as  a  multivariate  extension  of  moderator  analysis.  The 
literature  on  that  procedure  may  provide  useful  qualitative  guidelines 
to  help  avoid  some  potential  problems  (e.g.,  Zedeck,  1971). 

Latent  Growth  Curve  Analysis .  Growth  curves  are  a  special  area 
of  interest  in  SEM.  The  analysis  of  change  is  a  longstanding  problem 
(Harris,  1963)  .  However,  pattern  of  change  must  be  analyzed  to 
address  many  interesting  and  important  research  questions  about 
adaptation  and  development.  For  example,  how  do  individual  recruits 
adapt  to  the  psychological  stress  of  boot  camp?  How  do  the  attitudes 
that  affect  reenlistment  develop  over  time?  Can  early  patterns  of 
change  be  used  to  predict  success  and  failure  in  military  service? 

Such  questions  can  be  addressed  today  by  modeling  change  as  growth 
curves . 

The  SEM  approach  to  quantifying  change  was  stimulated  by  Rogosa 
and  his  colleagues  (Rogosa,  Brandt,  &  Zimowski,  1982;  Rogosa  & 

Willett,  1985)  .  Growth  curve  analysis  can  be  applied  when  a  variable 
of  interest  is  measured  at  several  points  in  time.  Given  multiple 
measurements,  a  growth  curve  expressing  the  level  of  the  measured 
variable  as  a  function  of  time  can  be  fitted  to  the  data  for  each 
individual.  Fitting  the  function  yields  a  set  of  parameter  values  for 
each  individual.  If  growth  were  linear  over  time,  the  parameters 
would  be  the  slope  and  intercept  of  a  regression  equation  that  applied 
to  a  specific  individual.  Those  parameter  values  then  can  be  treated 
as  dependent  variables  in  analyses  that  relate  them  to  attributes  of 
the  individual,  group  membership,  and  other  potential  predictors.  The 
growth  curves  fitted  to  data  usually  are  fairly  simple  (e.g.,  linear 
growth),  but  many  different  curves  are  possible  in  principle  (Rogosa  & 
Willett,  1985) . 

SEM  is  not  the  only  method  of  analyzing  latent  growth  curves. 
Raudenbush  and  Bryk  (2002;  Raudenbush,  2001)  have  extended  the  growth 
curve  approach  to  include  the  construction  of  multilevel  models  (cf. 
Hierarchical  Linear  Models  [HLM] )  discussed  in  the  following  section. 

Hierarchical  Linear  Models 

Factors  affecting  behavior  often  have  a  hierarchical  structure. 
Consider,  for  example,  the  problem  of  modeling  morale  during  military 
basic  training.  Morale  could  be  measured  at  weekly  intervals  during 
training.  Other  variables  would  be  measured  to  understand  the  factors 
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that  affect  morale.  These  variables  could  include  recruit 
characteristics  (e.g.,  age,  gender,  personality),  unit  membership 
(e.g.,  platoon,  division,  flight),  and  unit  characteristics  (e.g., 
number  of  recruits,  average  intelligence  test  scores,  average 
experience  of  instructors) .  Models  describing  morale  can  be 
constructed  from  the  combination  of  trends  over  time  and  a  combination 
of  individual  and  group  characteristics. 

HLM  provides  a  general  method  of  addressing  research  problems 
such  as  that  described  in  the  previous  paragraph  (Raudenbush,  2001; 
Raudenbush  &  Bryk,  2002) .  The  basic  problem  is  hierarchical  because 
it  involves  several  distinct  levels  of  analysis.  The  first  level 
consists  of  within-person  processes.  In  the  morale  example,  the 
within-person  model  would  describe  changes  in  morale  over  time  for  a 
single  individual.  Suppose  morale  was  low  early  in  training  and  then 
increased  at  a  constant  rate  over  time.  This  temporal  pattern  could 
be  modeled  by  representing  morale  as  a  linear  function  of  time.  The 
model  for  each  person  would  consist  of  the  slope  and  intercept  for  the 
linear  model.  The  linear  model  would  be  the  latent  growth  curve  for 
morale  for  that  individual. 

The  second  level  would  introduce  individual  differences  as  a 
factor  in  morale  during  training.  Recruit  characteristics  could  be 
used  to  predict  differences  in  the  slopes  and  intercepts  estimated  in 
the  within-person  level  of  analysis.  The  modeling  process  could  be 
extended  to  a  third  level  that  characterized  unit  differences  in 
morale.  This  level  of  the  model  would  test  for  average  differences  in 
the  intercept  and  slope  of  the  within-person  model.  This  level  also 
would  relate  the  average  differences  to  the  unit  characteristics  that 
had  been  measured.  The  HLM  approach,  therefore,  makes  it  possible  to 
analyze  changes  in  morale  as  a  combination  of  within-subj ect,  between- 
subject,  and  between-unit  effects. 

HLM  addresses  two  important  limitations  of  traditional  behavioral 
models.  Traditional  models  commonly  follow  disciplinary  boundaries 
that  treat  growth  processes,  individual  differences,  group  dynamics, 
and  social  environment  as  distinct  research  topics.  These 
distinctions  can  lead  to  incomplete  or  biased  models  when  the  behavior 
of  interest  is  affected  by  factors  at  more  than  one  level  of  analysis. 
In  such  cases,  the  predictive  accuracy  of  single-level  models  is 
limited  by  the  fact  that  some  systematic  sources  of  variance  are 
omitted.  The  model  is  incomplete  and  therefore  cannot  fully  account 
for  the  data.  Focusing  on  a  single  level  of  analysis  also  can  bias 
estimated  effects  for  the  current  level  of  analysis.  Bias  will  occur 
when  variables  in  the  single-level  model  are  confounded  with  causal 
factors  at  other  levels  of  analysis  (cf.,  James  et  al.,  1982).  HLM 
provides  tools  to  produce  models  with  greater  explanatory  power  by 
combining  the  several  levels  of  causal  factors.  The  same  tools 
produce  more  accurate  estimates  of  the  effects  of  factors  at  each 
level  controlling  for  the  effects  of  factors  at  other  levels. 
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The  advantages  of  HLM  can  be  critical  to  a  proper  understanding 
of  behavioral  phenomena.  For  example,  Duncan,  Jones,  and  Moon  (1993) 
analyzed  the  health  behavior  of  people  living  in  different  regions  of 
the  United  Kingdom.  Previous  research  suggested  that  health  behavior 
differed  between  geographical  regions.  Those  differences  could  be 
interpreted  as  evidence  that  regional  cultural  differences  affected 
behavior.  Duncan  et  al .  (1993)  used  HLM  to  show  that  the  reported 

geographical  differences  were  almost  completely  explained  by 
demographic  differences  between  regions.  In  a  very  different  context, 
Sliwinski  and  Hall  (1998)  applied  HLM  to  assess  claims  that  aging 
exerts  a  general  negative  effect  on  all  mental  capacities.  Sliwinski 
and  Hall's  (1998)  hierarchical  model  grouped  the  mental  tests  into 
categories.  When  those  categories  were  included  in  the  model,  age 
effects  were  limited  to  just  a  subset  of  the  mental  capacities.  These 
examples  (and  others,  see  Raudenbush  &  Bryk,  2002)  illustrate  that  HLM 
can  be  applied  to  problems  ranging  from  processes  occurring  within 
individuals  to  processes  that  characterize  broad  socioeconomic 
groupings . 

HLM  could  be  extremely  valuable  in  military  behavioral  research. 
Previous  applications  of  this  methodology  in  educational  research 
(cf.,  Raudenbush  &  Bryk,  2002)  have  obvious  parallels  to  military 
research  on  education,  but  the  potential  extends  well  beyond  this  area 
of  study.  The  key  elements  of  the  method  are  the  availability  of  a 
series  of  measurements  on  individuals  combined  with  a  hierarchical 
structure  of  some  type.  The  approach  could  be  applied  to  topics  such 
as  evaluating  different  weight  loss  programs  by  determining  growth 
curve  parameters  for  participants.  The  relationship  between  average 
growth  curve  parameters  and  quantitative  or  qualitative 
characteristics  of  the  programs  could  be  analyzed  to  identify  the 
critical  ingredients  of  effective  programs.  These  analyses  could 
include  adjustments  for  the  effects  of  individual  differences  on 
growth  curve  parameters.  The  adjustment  would  make  it  possible  to 
estimate  program  effects  while  controlling  for  differences  in 
personnel  composition. 

The  use  of  HLM  could  stimulate  theory  development  by  integrating 
models  from  traditionally  distinct  research  disciplines.  Theoretical 
statements  from  individual  differences  in  psychology,  social 
psychology,  sociology,  and  organizational  psychology  can  be  combined. 
This  end  is  achieved  by  developing  models  that  treat  the  individual, 
the  small  group,  and  social  categories  as  different  levels  in  a 
hierarchy.  Parameters  for  the  various  perspectives  can  be  estimated 
in  a  single  analysis.  The  estimation  procedures  adjust  for 
differences  at  other  levels  within  the  model,  so  allowance  is  made  for 
group  composition  when  analyzing  unit  effects  and  vice  versa 
(Raudenbush  &  Bryk,  2002). 

The  merging  of  theoretical  perspectives  could  increase  the 
importance  of  models  of  group  dynamics  in  some  key  areas.  For 
example,  delinquent  behavior  (e.g.,  attrition,  nonjudicial  punishment) 
is  more  prevalent  in  some  military  units  than  others.  If  the 
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differences  remain  after  controlling  for  unit  composition,  attention 
to  explanations  based  on  unit  factors  is  a  reasonable  step.  Military 
tradition  makes  it  likely  that  leadership  would  become  one  focus  of 
inquiry.  However,  leadership  may  not  be  the  source  of  the  unit 
differences  (Vickers,  Hervig,  Wallick,  &  Conway,  1984) .  If  so, 
alternative  models  must  be  considered.  Behavioral  contagion 
(transmission  of  negative  attitudes  from  one  person  to  another)  is  an 
example  of  a  possible  alternative  (Jones,  1998)  .  Properly  applied, 

HLM  can  test  individual  difference,  unit  leadership,  and  behavioral 
contagion  models  at  the  same  time.  This  joint  representation  of 
multiple  levels  of  analysis  in  a  single  model  may  be  needed  to 
represent  the  complexity  of  the  influences  on  real  behavior. 

Using  HLM  to  avoid  incomplete  and  biased  models  will  introduce 
new  research  design  issues.  In  general,  the  organizational  units 
studied  are  only  a  sample  from  a  population  of  similar  units.  This 
observation  raises  sampling  issues.  How  many  units  should  be  sampled 
to  accurately  estimate  effects?  A  large  sample  of  individuals  no 
longer  guarantees  adequate  sampling.  The  sampling  frame  for  a  study 
must  include  units  as  entities  to  be  sampled.  If  it  is  convenient,  all 
of  the  unit  personnel  may  be  included  in  a  study.  However,  covering  an 
adequate  sample  of  units  could  require  collecting  data  from  very  large 
numbers  of  individuals.  If  the  cost  of  collecting  data  from  an ^ 
individual  is  high,  it  may  be  necessary  to  use  stratified  sampling 
within  units. 

One  aspect  of  HLM  is  noteworthy  to  avoid  confusion  when 
considering  this  method  in  conjunction  with  other  methods  discussed 
here.  In  the  context  of  HLM,  a  latent  variable  is  any  unmeasured 
variable  (Raudenbush  &  Bryk,  2002).  Thus,  references  to  latent 
variable  modeling  are  not  equivalent  to  references  to  latent  trait 
modeling  in  SEM.  Instead,  latent  variable  discussions  in  HLM  are  more 
likely  to  address  problems  such  as  the  imputation  of  missing  data.  In 
this  case,  the  missing  data  comprise  a  latent  variable  in  the  sense 
that  each  case  presumably  has  a  specific  value,  but  that  value  is  not 
measured  in  the  study.  These  superficially  different  uses  of  the  term 
"latent  variable"  are  not  contradictory.  Instead,  the  different 
applications  of  the  term  represent  different  instances  of  unmeasured 
variables.  The  development  of  an  integrated  perspective  on  latent 
variables  is  an  ongoing  topic  of  discussion  in  data  analysis  generally 
(Bollen,  2002 ) . 

Categorical  and  Limited  Dependent  Variable  (CLDV)  Models 

A  wide  variety  of  models  can  be  grouped  together  under  the 
heading  of  CLDV  models.  These  models  deal  with  dependent  variables 
that  are  categorical  (e.g.,  pass-fail)  or  time-limited  in  some  way 
(e.g.,  the  observation  period  is  stopped  or  participants  die,  move 
away,  or  refuse  to  continue  participation) .  In  each  case,  the  nature 
of  the  dependent  variable  creates  difficulties  because  it  either  is 
not  linearly  related  to  predictors  or  because  error  variance  is 
heteroscedastic,  or  both. 
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Military  researchers  often  encounter  CLDV.  Examples  of 
dichotomous  categories  include  pass— fail  measures  of  success  in 
training  programs,  attrition,  and  reenlistment.  More  complex 
categorical  criteria  are  represented  by  specific  reasons  for  attrition 
(e.g.,  medical,  behavioral,  administrative)  or  rank/rate  at  the  end  of 
a  period  of  enlistment .  Time-limited  variables  can  occur  when  an 
individual  is  transferred  from  one  unit  to  another  or  returned  to 
c 1  icin.  status  prior  to  completing  an  enlistment.  Accurate  modeling 
of  CLDV  obviously  is  important. 


Recent  developments  have  produced  an  integrated  approach  to  CLDV 
models.  The  approach  has  two  fundamental  elements  (McCullagh  & 

Nelder,  1989).  Each  model  has  a  transformation  component  known  as  a 
link  function.  The  link  function  generates  transformed  variables  that 
have  linear  relationships  to  predictor  variables.  The  choice  of  a 
link  function  depends  on  the  nature  of  the  dependent  variable.  The 
critical  point  is  that  the  transformed  variables  can  be  analyzed  using 
familiar  linear  regression  methods. 


The  second  element  of  CLDV  models  is  known  as  the  systematic 
component.  This  component  is  a  linear  model  quantifying  the 
relationships  between  predictors  and  the  transformed  dependent 
variable  generated  by  the  link  function.  CLDV  models  can  be  fitted 
using  the  general  linear  model  (GLM)  approach  to  obtain  appropriate 
statistics . 


The  link  function  approach  to  CLDV  has  several  positive 
characteristics  (Long,  1997) .  The  most  important  is  that  it  is  no 
longer  necessary  to  settle  for  linear  approximations  to  more  complex 
relationships.  Complex  mathematical  functions  can  be  approximated  by 
linear  functions  over  narrow  ranges  of  scores.  In  the  context  of 
CLDV,  this  fact  means  that  a  traditional  linear  model  based  on  raw 
data  can  have  acceptable  predictive  power  even  though  it  does  describe 
the  true  functional  relationship  between  the  predictors  and  the 
dependent  variable.  The  apparent  predictive  power  can  be  misleading. 
For  example,  basic  assumptions  such  as  homoscedasticity  of  error 
variance  can  be  violated.  More  importantly  the  resulting  equation  may 
give  the  impression  that  outcomes  (e.g.,  risk  of  attrition)  increase 
equally  with  each  point  in  the  scale  score.  In  fact,  the  change  may 
be  much  greater  in  some  areas  of  the  score  range  than  in  others.  The 
appropriate  mathematical  expression  must  be  employed  to  accurately 
assess  effects  and  accurately  predict  outcomes . 

Capitalizing  on  the  advances  in  CLDV  analysis  also  provides  other 
payoffs.  Link  functions  make  it  possible  to  apply  familiar  methods 
from  regression  analysis  to  build  more  robust  models.  For  example, 
outlier  and  influential  data  points  can  be  identified  in  CLDV  analyses 
using  the  same  procedures  employed  in  linear  regression.  As  another 
example,  polynomial  regression  can  be  used  to  express  the  transformed 
variable  as  a  nonlinear  function  of  a  predictor.  The  link  function 
approach  also  makes  it  possible  to  apply  maximum  likelihood  methods 
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for  estimating  parameter  values.  Thus,  familiarity  with  this  general 
approach  to  estimation  helps  construct  CLDV  models. 

CLDV  models  cannot  be  interpreted  directly.  Link  functions 
involve  nonlinear  transformations.  The  transformation  must  be 
reversed  to  predict  the  original  variable.  The  reverse  transformation 
must  be  considered  because  a  change  of  one  unit  in  a  predictor  can 
have  very  different  effects  in  different  score  ranges.  For  example, 
in  a  logistic  regression  model,  the  effect  of  a  one-point  difference 
in  scores  is  not  the  same  over  the  full  range  of  scores.  At  the  high 
and  low  ends  of  the  range,  a  one-point  difference  typically  will  have 
little  effect  on  the  predicted  outcome.  The  same  difference  can  have  a 
pronounced  effect  in  the  middle  range  of  the  score  distribution.  This 
point  should  not  be  a  barrier  to  the  use  of  CLDV  methods.  The 
benefits  of  the  approach  make  it  worthwhile  to  take  the  time  to 
understand  its  procedures. 

One  ongoing  line  of  development  in  CLDV  analysis  is  of  special 
interest  for  model  construction.  The  use  of  specific  contrasts  within 
a  GLM  approach  to  CLDV  has  been  explored  as  a  method  of  formulating 
and  testing  alternative  causal  paths  (von  Eye  &  Brandstadter,  1998) . 
These  causal  paths  include  a  "wedge"  by  which  two  causes  can  cause  a 
single  outcome,  a  "fork"  by  which  a  single  cause  can  lead  to  different 
outcomes,  and  a  "chain"  by  which  a  cause  and  an  outcome  are  linked  by 
an  intermediate  variable.  This  approach  holds  promise  of  making  it 
possible  to  use  procedures  such  as  loglinear  modeling  to  test  very 
specific  causal  hypotheses  expressed  entirely  in  terms  of  categorical 
variables.  The  basic  method  draws  on  the  construction  of  contrasts 
that  are  similar  to  those  found  in  ANOVA,  but  the  structure  of  the 
contrasts  is  linked  to  specific  hypotheses  about  sources  of  causal 
effects . 

Path  Models  and  MAGIC 

The  methods  described  above  can  increase  the  quality  of  evidence 
available  to  principled  argument  process.  The  relationships  between 
these  methods  and  MAGIC  criteria  are  too  complex  to  describe  in 
detail.  However,  some  examples  of  potential  gains  can  be  provided. 

Good  explanatory  power  (magnitude)  is  a  fundamental  goal  of 
modeling.  This  goal  can  be  promoted  several  ways. 

•  The  HLM  discussion  illustrated  the  potential  for  combining 
sources  of  variance  that  might  be  explored  independently  in 
traditional  models.  The  combination  should  increase  overall 
explanatory  power.  The  combination  strategy  also  reduces  the 
risk  of  biased  parameter  estimates. 

•  SEM  can  provide  greater  insight  into  the  true  strength  of 
associations.  This  gain  is  expected  because  SEM  estimates  are 
corrected  for  measurement  error.  The  modeling  implication  is 
that  a  given  model  will  not  be  preferred  to  an  alternative 
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simply  because  the  first  is  represented  by  more  reliable 
measurements . 

Advances  m  CLDV  analysis  methods  make  it  possible  to  replace 
mear  approximations  involving  raw  variables  with  appropriate 
functional  relationships.  For  example,  probit  or  logit 
analyses  can  be  used  in  place  of  linear  regression.  These 

of  "^associations6'  haV6  ““  P°tential  to  increase  the  magnitude 

Articulation  will  improve.  The  task  of  accurately  translatina 

TB2lockCl969?tS  ™  "a"??'  °  ma°Mmatical  eauations'^  is  difficult 

(Blalock,  1969).  The  difficulty  increases  if  the  analysis  cannot 

incorporate  key  elements  of  the  theory.  Key  elements  may  include 

Zs“aintsr„nUrVUiJear  equality  or  proportioSty 

constraints  on  specific  parameters,  and  interactions.  The  methods 

described  here  make  it  easier  to  build  these  details  into  models. 

Generality  is  an  empirical  question,  as  noted  earlier.  The  basic 
issue  is  whether  a  single  model  is  appropriate  for  all  entities  (e  g 
people  or  groups)  or  if  specific  models  are  needed  for  different  ’  " 

ranSFoS  °f  pities  (e.g.,  men  and  women).  The  methods  described  here 
an  be  applied  to  test  for  generality.  SEM  methods  of  testing  for 

HlJTS^vi  haVS  beSn  alluded  to  in  connection  with  measurement  models 

models  ?h?saf?rea^?eal  °f  flexibility  in  specifying  alternative 
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T?nWr-  f  different  subgroups.  Because  HLM  methods  include  the 
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here  mht  value  may  also  be  improved  by  the  methods  described 

e.  The  process  of  articulating  the  rationale  for  a  model  is  likelv 
to  stimulate  thinking  about  alternative  models.  Comparing  plausible  Y 
“  iScm°-  interesting  than  simply  evaluating  ^“l 
isolation.  Given  similar  models,  the  use  of  methods  that  make  if 
possible  to  isolate  key  differences  between  models  will  toease 

analyses  that^f  ^  meth°dS  described  in  this  section  can  promote 

The  effects  of  motHf?1  Cr^1Cal  Parameters  that  differ  between  models, 
tects  of  modifying  those  specific  parameters  may  be  modest  in 

terms  of  the  absolute  fit  of  the  model,  but  still  could  L  important 

I  sotn8^  tWB6n  alternative  models.  The  methods  described  here 
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by  direct  comparison  of  models  based  on  different  theoretical 
positions  and  by  stronger  empirical  tests  of  claims  for  the  generality 
ifferent  models.  Because  this  process  reduces  the  likelihood  of 
confirmation  bias,  credibility  should  increase  even  if  the  e^dence 
oes  no  c  early  support  one  model  over  competing  alternatives. 


Model  Evaluati 


on 


Once  constructed,  a  model  must  be  evaluated  for  its  adequacy 
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the  main  business  of  research  (Kirk,  1996).  Statistical  models 
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Model  Appraisal 
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^  ^  iS  imp°rtant  to  Properly  interpret  these 

Statistical  concepts.  The  null  hypothesis  significance  Lest 
(NHST)  and  common  ES  criteria  are  examined  to  illustrate  the  state  of 

td*!  in  tradltlonal  model  appraisal  practices.  SEM  appraisal 
methods  are  examined  to  illustrate  an  alternative  approach  that  is 
likely  to  have  an  increasing  influence  in  the  future. 

Significance  Tests 

•  ,  Significance  tests  are  the  most  common  tool  for  model  appraisal 

ocusedVlDi?  ^search  significance  tests  can  be  diffuse  (oTibus,  or 
focused.  Diffuse  significance  tests  involve  multiple  degrees  of 
freedom  (dfs);  focused  tests  involve  a  single  degree  of  freedom 
(  osenthal  &  Rosnow,  1984).  For  example,  tests  for  main  effects  and 
interactions  m  ANOVA  often  involve  df  >  1 .  Planned  or  post  hoc 

testsaStS  deC°mpose  the  omnibus  test  into  a  set  of  component  (df  =  1) 

Significance  test  procedures  can  involve  an  NHST  or  a  strona 
significance  test  (SST) .  NHST  procedures  compare  the  sample  estimate 
of  a  parameter  value  to  a  hypothetical  value  of  zero.  SST  computations 
compare  sample  estimates  to  values  derived  from  theory  or  prior 
research.  SST  will  include  zero  values  when  appropriate,  but  such 
cases  are  expected  to  be  rare  (Meehl,  1978;  Schmidt,  1996).  Thus  the 
hypothesized  parameter  value (s)  in  SST  ordinarily  will  be  nonzero 

J-  U  C  S  » 


The  focused  diffuse  distinction  is  less  important  for  the  present 

thpCdSfi°n  than  NHST_SST  distinction.  The  major  issue  associated  with 
the  difference  between  focused  and  diffuse  tests  involves  the  number 

evaliSe  aC^ete  per£ormedL  Diffuse  tests  apply  a  single  test  to 

tests  K  e  one  fParaI"h  F°CUSeCl  reC!uires  multiple  significance 

tests  ( i • e . ,  one  for  each  degree  of  freedom).  The  process  of 

^rn™ng  multlPle  significance  tests  increases  the  likelihood  that 
at  least  one  result  will  be  significant  by  chance.  Special  procedures 
°  Wlt£  thls  problem  can  be  used  to  control  the  inflated  risk  of 

atmfT  f°r  "  trUS  SffeCt  Telman,  Cribbie,  &  Holland, 

3.999,  Seaman,  Levin,  &  Serlin,  1991).  Various  methods  for  planned  and 
post  hoc  comparison  in  ANOVA  are  examples  of  how  to  deal  wiS  this 

assumes  SZT*'**™'  ‘  Michels'  1991>'  following  discussion 

multiple  significanc^testi006^^5  Wl11  ^  USed  t0  Contro1  for 

NHST  and  SST  have  very  different  implications  for  modeling 
Roughly  speaxmg ,  NHST  evaluates  whether  an  attempt  to  construct  a 
model  us  worthwhile.  SST  determines  whether  an  existing  S" “  e 
t  of  parameters)  should  be  retained.  NHST  and  SST  procedures  are 
complementary  when  viewed  as  elements  of  an  overall  model  development 
process.  Initially,  NHST  is  employed  with  the  null  hypothecs  a 

reS™  (Krant2'  1999)  '  If  thS  nUl1  ^thesis  c^t  L 

evaluated  l  Jd  i  reaS°nadle  estimate  for  the  parameter (s)  being 

evaluated.  A  model  m  which  all  parameters  are  zero  will  be  of 
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substantive  interest  only  insofar  as  it  rules  out  some  possible 
relationships.  Thus,  NHST  is  most  likely  to  be  performed  in 

thatea^e°beTlth  f  t0  dSVel°P  estimates  of  parameters 

h  1  J bel^eved  to  have  nonzero  values.  As  further  research  is 
conducted  and  evidence  accumulates,  refined  estimates  of  model 
parameters  can  be  developed  based  on  the  cumulative  research  evidence 
SST  can  be  applied  when  the  model  is  represented  by  a  set  of  nonzero' 
parameter  estimate (s)  based  on  prior  work  or  theory.  in  this 
situation,  the  question  is  whether  the  sample  estimates  of  the 

rror  Si:tL^°:Ltofthe  predicted  vaiues-  using  the  ^^1 P 

is'p  <  i  Cancefcriteri°n'  the  desirable  NHST  outcome 

IS  P  <  .05.  This  outcome  justifies  tentatively  adding  the  parameter (s) 

o  >  ol  ^  1°  the  behavloral  The  desirable  SST  ou^ome  If 

P  .05.  This  outcome  would  justify  retaining  the  existing  model. 

NHST  Procedures.  NHST  is  the  most  cordon  metric  for  model 
appraisal  (Finch,  Cumming,  &  Thomason,  2001;  Kirk  1996-  Vacha  Haa^e 
Ness,  1999).  Meehl  (1978)  has  argued  that  ^elia^e  on  NH^  Js"one 
reason  for  the  stunted  growth  of  behavioral  models.  His  viewpoint  is  a 
widely  quoted  anchor  point  in  an  ongoing  debate.  ArguLSHn  the 
debate  range  from  recommending  that  NHST  be  banned  entirely  to  arguino 

<Abe  5SS?  *“£  C°  be  invented  if  it  did  not  already exi^ ^  " 

son,  1997).  The  American  Psychologist  recently  published  a 
negative  view  (Cohen,  1994),  followed  by  a  rebuttal  (Sagen,  1997)  and 

^  ,at  SYnthesis  (Kreuger,  2001).  The  scope  of  the  debate  is 
oadened  by  examining  topics  such  as  the  actual  use  of  NHST  in 
practice  (Nelson,  Rosenthal,  &  Rosnow,  1986)  and  the  historical 

&  tevis' i9s2;  snith- Best- 

StuJobs,  2000).  The  full  range  of  topics  considered  in  the 

can  be  found  in  Harlow,  Mulaik,  and  Steiger  (1997,  .  A  com^rLon 

between  this  and  an  earlier  collection  by  Morrison  and  Hentel  (W70) 

mcSSon^lSoS  T°  the  ratS  aC  ”hich  the  debate  progressed7” 
current  st«us  of  Jhe  debl^  *  ^  summary  of  the 

,  .  NHST  can  be  a  trap  for  the  unwary.  Cohen  (1994,  p  997) 
highiighted  this  problem  when  he  wrote:  -what's  wrong  with  NHST? 

toow  ‘TanY  °SSr  thin9S'  iC  does  «  tsl1  PS  wLt  we  want  to 

r„  .  italics  added).  Researchers  collect  data  for  the  purpose  of 

the  status  SSa  modST  fSUltS  Can  lead  to  erroneous  inferences  about 
tne  status  of  a  model  for  any  of  the  following  reasons: 

correcfT  t  ^  n0t  ^  probability  that  the  model  is 

t.  instead,  p  is  the  probability  of  the  data  if  the  null 
hypothesis  is  correct.  The  critical  point  here  is  that  the  p 

the QdataSre?Jtrf31?hd  ^  °ther  informati°n  to  determine  how 
Dunlfn  H997f  n  Probability  of  the  model.  Cortina  and 

^  mow  m  and  O'Reilly  (1999),  Krueger  (2001),  and 

?  discuss  Bayes's  theorem  as  the  appropriate 

nfor  using  P  as  one  element  in  estimating  the 
pro  a  ility  that  the  model  is  correct.  Howard,  Maxwell,  and 


31 


Statistics  and  Model  Construction 


Fleming  (2000)  compare  the  Bayesian  and  NHST  approaches  For 
the  present  purposes,  it  is  sufficient  to  note  that  the 
relationship  is  not  straightforward.  For  example,  the  null 
hypothesis  can  be  rejected  when  the  data  actually  increase  the 
probability  that  this  hypothesis  is  true  (Lindley,  1957). 

The  complement  of  the  NHST  p  value  (i.e.,  1  -  p)  derived  from 
a  single  study  is  not  the  likelihood  that  the  alternative 
model  is  correct.  The  complement  is  not  the  likelihood  that 
the  results  will  replicate.  Both  interpretations  are  wrong, 
although  NHST  p  values  can  be  a  rough  guide  to  the  likelihood 
of  replication  (Greenwald,  Gonzalez,  Harris,  &  Guthrie,  1996) 

Rejecting  the  null  hypothesis  in  each  of  several  studies  does 
not  mean  their  results  were  replicated.  If  the  sign  of  the 
statistic  used  in  the  test  was  the  same  in  each  study,  the 
results  replicate  qualitatively.  This  qualitative  criterion  is 
accepted  as  evidence  of  replication  under  NHST  (Greenwald  et 
al.,  1996).  However,  a  quantitative  replication  criterion 
could  produce  a  different  conclusion.  For  example,  suppose 
three  studies  were  conducted  with  N  =  200  in  each  study. 
Suppose  the  correlations  in  the  studies  were  r  =  .15,  r  =  50 

and  r  -  .90.  The  null  hypothesis  would  be  rejected  in  each 
study.  However,  most  researchers  would  be  reluctant  to  treat 
the  results  as  equivalent  because  every  pairwise  difference 
would  be  statistically  significant. 


NHST  does  not  indicate  whether  a  particular  parameter  is  large 
enough  to  be  important  in  practical  or  theoretical  terms  The 
conceptual  definition  "Significance  =  Effect  Size  *  Sample 
Size"  (Rosenthal  &  Rosnow,  1984)  shows  why.  Even  trivial 
deviations  from  zero  will  be  statistically  significant  given  a 
arge  enough  sample.  Conversely,  effects  that  are  large  enough 
to  have  practical  and/or  theoretical  value  will  be 
statistically  nonsignificant  if  the  sample  is  small  enough. 

Har^nw1^qq7^etlVe-SltfallS  °an  be  avoided  bV  careful  use  of  NHST. 
Hariow  U997)  provides  a  succinct  summary  of  options  that  are 

available  to  minimize  the  risk  of  misinterpretation.  However,  it  is 

no  easy  to  maintain  perfection  in  this  regard.  Cohen  (1994)  lists  an 

impressive  array  of  established  statistical  experts  who  have  erred  at 
one  time  or  another. 


The  list  of  things  that  NHST  does  not  tell  us  is  impressive  so 
^  y  a  e  t  e  risk?  The  answer  lies  in  the  fact  that  NHST  really  is 
necessary  !n  some  instances.  NHST  is  appropriate  for  evaluating 

finding!j  are  due  to  chance  (Mulaik,  Raju,  &  Harshman,  1997). 

involve  dbhob  mati1VS  in  answering  some  specific  questions  that 
tqofi  S  dlchc7°™°us  alternatives  (Abelson,  1997;  Greenwald  et  al., 
199fa;  Hagen,  1997;  Mulaik  et  al.,  1997;  Wainer,  1999)  These 
applications  of  NHST  support  the  argument  that  this  procedure  is  a 
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1997^SarY  lf  sometirtles  misleading  tool  for  model  evaluation 


(Abelson, 


The  recommended  strategy  for  minimizing  the  negative  effects  of 

intl^l  (cnP?ftSSUlt?  m°re  COmplete1^  1997).  A  confidence 

interval  (Cl)  is  the  most  coitmon  recoimnended  alternative  to  NHST  for 

Sdie^rP°fC  This.int™l  Provides  a  point  estimate  of  ES  and 
es  e  precision  of  the  estimate  (Cumming  &  Finch,  2001- 

Inference,  I99  ^j6''  WilJinson  s  the  Ta=>=  Force  on  Statistical 

'  *  CIs  are  directly  linked  to  the  familiar  NHST 

Stimftef  and  SUPP°rt  the  development  of  cumulative  parameter 

Method^  nf  3  CSSa  domain  matures  (Cumming  &  Finch,  2001). 

Methods  of  computing  confidence  intervals  are  available  for  all  common 
indicators  (Algma  &  Moulder,  2001;  Cumming  &  Finch  2001-  Fan  & 

Smiths^'  2001)  Fitler  &.Th0mpS0n'  2001;  Mendoza  &  Stafford,'2001; 
^h.:-2001)-  At  a  minimum,  investigators  should  report  the  exact 

e  o  t  -S2  »S0r,?X^  l«vel  along  with  sLple  sizf 

(e.g.,  t  -  2.88,  32  df ) .  This  information  generally  is  sufficient  to 
permit  computations  of  ES  and  Cl.  The  ES  component  of  the  Cl  leads  the 
discussion  directly  to  the  second  criterion  for  evaluating  models 

ATHqqi  SST  Procedures.  SST  avoids  some  NHST  problems  by  replacing  the 
NHST  assumptron  that  ES  =  0  with  ES  =  *.  where  *  is  I  param«e?  vaLe 
that  differs  from  zero,  while  k  could  be  based  on  theory  behavioral 
theories  seldom  are  sufficiently  developed  to  permit  this  Parapet er 

values  are  more  likely  to  be  derived  from  prior  research  'SST 
therefore,  can  be  viewed  as  a  consistency  test.  Are  tSecuSnt  data 
consistent  with  the  evidence  from  prior  studies?  If  p  >  05  thi^ 

question  can  be  answered  affirmatively.  if  the  sample  were  large"  the 

busman  ^xftr  VdXrS  that  W°Uld  Yi0ld  an  affirmative  answer  would 
be  small.  if  the  model  is  not  correct,  observed  values  that  were 

os e  enough  to  the  predicted  values  to  fall  in  the  range  of 

19  84?  a  16  ValUS^  WOuld  be  "a  darned  strange  coincidence"  (Salmon 
84).  As  a  result,  the  SST  would  be  a  risky  test  of  consistencv 
between  the  present  data  and  either  prior  research  or  theory  because 

?Meehiar^90a?  T^oincV^  ^  ;jnC0nsistent  with  the  model  prediction 
ueem  iyyoa).  A  coincidence  that  is  consistent  with  a  riskv 

prediction  provides  strong  support  for  the  model  being  tested 

SST  and  NHST  are  formally  similar.  Both  tests  estimate  the 
probability  that  the  study  results  would  have  been  obtained  under  a 

S  tL”  tT  a?ertS  that  the  meters  in  the  mod Tire 
k  SST  specifies  non-zero  values.  This  difference  is  the 

reason  that  NHST  and  SST  are  complementary  in  the  coat^^f  oJerSl 
research  programs.  SST  cannot  be  used  without  knowledge  of  the 
parameter  values,  so  this  procedure  is  not  feasible  iSthe  Sitia! 
s  ages  o  the  study  of  behavioral  phenomena.  SST  can  be  used  once 

poi«r  rn-2er°  Values  f°r  the  Parameter  estimaSs  At  this 

P  int,  NHST  would  be  counterproductive  because  it  iernores  prior 

findings.  Thus,  replacing  NHST  with  SST  implies  movement  alona  the 

SST ZTZirZ:V°rat01:V  t0  MovemSt  L^ard 

SST  is  desirable  because  it  implies  stronger  theory  based  on 
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cumulative  empirical  evidence.  Movement  toward  SST  should  facilitate 
the  development  of  reliable  knowledge.  Meta-analysis  provides  methods 
of  accumulating  results  across  studies  (Glass,  1976;  Glass,  McGaw,  & 
Smith,  1981) .  This  analytic  methodology  is  widely  used  at  present,  but 
meta-analytic  results  do  not  appear  to  be  used  to  generate  SST  with 
any  frequency. 

In  the  final  analysis,  neither  NHST  nor  SST  is  an  entirely 
satisfactory  method  for  model  evaluation.  Neither  procedure  addresses 
the  fundamental  question  of  whether  the  model  is  sufficiently  accurate 
to  satisfy  Serlin  and  Lapsley's  (1985)  "good  enough''  principle.  NHST 

1SC?CSatlSfaCt0rY  because  the  nul1  hypothesis  can  be  rejected  when  a 
model  has  virtually  no  explanatory  power  provided  the  sample  is  large 
enough.  Similarly,  the  existing  model  associated  with  SST  can  be 
accepted  even  though  it  meets  the  criteria  for  a  risky  test.  This  can 
happen  even  with  a  large  sample  if  the  model  parameters  are  known  with 
some  accuracy  but  represent  only  a  subset  of  the  parameters  required 
for  a  complete  model.  The  accuracy  of  the  model  is  a  distinct  issue 
that  can  only  be  addressed  by  considering  an  additional  criterion 
explanatory  power.  ' 


Explanatory  Power 

.  ,  Explanatory  power  is  how  well  the  model  accounts  for  variation  in 
the  phenomena  of  interest.  This  model  attribute  often  is  evaluated  in 
terms  of  proportional  reduction  in  error  (PRE) .  pre  reflects  the 
proportional  reduction  in  cumulative  error  achieved  by  substituting 
e  predictions  from  a  fitted  model  for  the  predictions  from  the  null 
model.  Common  PRE  indices  are  r2  for  correlation,  R2  for  regression, 

Aiken  braper  and  Smith  ( !998 )  and  Cohen,  Cohen,  West,  and 

iken  (2003)  provide  excellent  introductions  to  explanatory  power  in 

relation  to  applied  regression  procedures.  Their  sections  on  model 
it  and  related  topics  should  apply  to  various  types  of  GLM  models, 
or  example,  computer  programs  often  print  out  ANOVA  tables  for 
regression  models  and  estimates  of  R2.  pre  measures  also  are  available 
Rosenthli?  categorical  dependent  variables  (Hildebrand,  Laing,  & 


Explanatory  power  is  linked  to  ES.  The  linkage  makes  it  possible 
o  express  explanatory  power  in  terms  of  either  strength  of 
association  (e.g.,  R2,  s2)  or  magnitude  of  ES  (e.g.,  r,  Cohen's  d)  . 
Both  association  and  magnitude  indices  are  readily  available  for 
common  analysis  procedures  (e.g.,  regression,  ANOVA,  cf.,  Cohen  1988- 

Hedges  &  Olkm,  1985;  Kirk,  1996). When  reporting  ES  or  PRE  several 
points  should  be  kept  in  mind:  several 


Dichotomous  decision  rules  are  counterproductive.  The  limitations 
o  ^  is  approach  are  evident  from  the  history  of  NHST.  NHST 
procedures  were  developed  in  the  context  of  the  need  to  choose 
e  ween  a  ternative  courses  of  action  (Cowles  &  Davis,  1982). 
igni  icance  standards  were  rule-of-thumb  criteria  established  by 
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well  informed  individuals  who  recognized  a  need  to  make  a  yes-no 
decision  in  the  presence  of  uncertainty.  The  extensive 
literature  on  NHST  demonstrates  the  problems  that  arose  when  this 
procedure  subsequently  was  codified  and  ritualized  (e.g.,  Meehl, 
1978,  1990b).  Flexible  reasoning  will  be  more  productive  than 
rigid  application  of  a  dichotomous  decision-making  scheme.  Thus, 
Cohen's  (1988)  ES  guidelines  should  be  applied  in  the  spirit  in 
which  they  were  offered.  Transforming  these  guidelines  into  rigid 
rules  for  dichotomous  decisions  would  be  a  serious  mistake. 

•  Small  ESs  can  be  important.  In  applied  research,  small  effects 

can  be  important  when  they  involve  repetitive  events  that  yield 
large  cumulative  trends  (Abelson,  1985)  or  when  the  outcome  being 
predicted  is  very  important  (e.g.,  heart  attack  mortality;  Rosnow 
&  Rosenthal,  1989).  In  theoretical  studies,  small  ESs  can  be 
important  when  there  is  a  small  difference  between  stimuli  that 
produce  an  effect  and/or  when  the  dependent  variable  is  difficult 
to  influence  (Prentice  Sc  Miller,  1992).  . . . ' 

•  Capitalization  on  chance  inflates  sample  estimates  of  explanatory 
power.  When  parameters  are  estimated  using  data  from  a  single 
sample,  the  analysis  procedures  are  designed  to  maximize  the  fit 
of  the  model  to  the  data.  The  maximization  process  capitalizes 
on  chance  elements  of  the  data.  As  a  result,  the  model  will  not 
fit  the  data  from  a  new  sample  as  well  as  it  did  the  data  from 
the  original  sample.  The  loss  of  predictive  power  is  known  as 
shrinkage.  Methods  of  adjusting  for  shrinkage  have  been  developed 
to  obtain  more  realistic  estimates  of  the  predictive  power  that 
can  be  expected  when  a  model  is  applied  to  a  new  data  set  For 
example,  the  shrunken  R2  for  regression  and  the  q2,  a  comparable 
statistic  for  ANOVA  (Hays,  1963),  allow  for  this  inflation. 
Joreskog  and  Sorbom's  (1981)  adjusted  GFI  is  an  SEM  analogue  of 
the  shrunken  R  .  Raju,  Bilgic,  Edward,  and  Fleer  (1997,  1999) 
reviewed  and  simulated  the  performance  of  a  number  of  equations 
for  shrunken  R  .  In  their  simulation,  shrinkage  increased  as  the 
predictive  power  of  the  model  decreased,  as  the  sample  size 
decreased,  and/or  as  the  number  of  predictors  in  the  model 
increased.  These  model  components  had  more  effect  on  shrinkage 
than  did  the  choice  between  alternative  shrinkage  equations. 

These  findings  should  generalize  to  other  GLM  analyses  (e.g., 

ANOVA  models).  Thus,  investigators  should  be  especially  concerned 
about  shrinkage  when  a  model  with  many  predictors  yields  moderate 
to  low  predictive  power  in  a  small  sample.  Browne  (2000)  provided 
a  general  discussion  of  shrinkage  and  the  available  methods  of 
adjusting  for  this  capitalization  on  chance. 

The  choice  of  ES  should  be  appropriate  to  the  modeling  objective. 
Fox  example,  m  regression,  the  semipartial  correlation  expresses 
PRE  relative  to  the  overall  variance  in  the  dependent  variable. 
Significance  tests  are  based  on  the  partial  correlation,  a 
statistic  that  relates  incremental  PRE  to  the  residual  variance 
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illoinlTToA^T'  1983'  PP-  85"1101'  When  the  overan  model 

accounts  for  a  large  propoitron  of  the  criterion  variance  the 

correlation  elatlon  £an  be  sma11  even  though  the  partial 
NHST  ihiS  T?6  Partial  correlation  is  the  basis  for 

■f  ^  tL  statlstlc  relates  the  incremental  variance  accounted 
or  by  adding  the  parameter  to  the  residual  variance  for  the 
overall  model.  The  semipartial  correlation  expresses  the 
incremental  variance  relative  to  the  overall  variance  of  the 
ependent  variable.  For  example,  if  a  model  accounted  for  90%  of 

for  Io^ofCthenr^ddPe?dent'Variable'  3  parameter  that  accounted 
°L? 1  residual  variance  would  only  account  for  1%  of  the 

nance.  The  model  is  being  constructed  to  explain  the 

correlation1^?'  ^  ^  residual  variance.  The  semipartial 

relation  indicates  the  explanatory  power  of  the  model  in  this 

correlation^ S°  b®  appropriate  than  the  partial 

correlation  for  most  modeling  situations. 


traditional  i nr?  ?S  *  problem  for  model  appraisals  based  on 
andd???  i  d  f°r  explanat°ry  power.  Problems  arise  because  ES 
and  PRE  indices  are  set  in  a  statistical  frame  of  reference  in  each 

case,  raw  data  are  transformed  into  standardized  data.  The  advantage 
of  transforming  the  data  is  t~"hat  fq  ^ 

different  variLo,  f  values  can  be  compared  even  when 

r rerent  variables  in  the  model  have  different  raw  score  metrics  Fnr- 

xamp  e,  analysis  might  yield  an  ES  represented  by  a  point  biserial 
correlation  between  experimental  status  m  o  *  .poinc  Diserial 
arouD)  of  r  -in  ^  status  (i.e.,  experimental  or  control 

g  oup)  of  rpb  -  .30.  The  associated  PRE  statistic  would  describe  the 

w?1  ??nSh?haS  accountin9  for  9%  of  the  variance  in  the  dependent 
variable.  Cohen's  (1988)  criteria  would  classify  the  association  ^ 

exper imen^was 26 " t T^ese  statements  could  be  applied  whether  the 

penment  was  a  training  program  designed  to  increase  push-up  scores 

?o  ?niCa?  lntervention  to  reduce  depression,  or  a  new  method  of 
teaching  designed  to  improve  algebra  test  scores. 

The  disadvantage  of  ES-based  model  appraisal  derives  from  the 

ra“  fta'  The  standardization  mus  "be  reversed 
LL??  L  behavioral  units  relevant  to  the  original  research 

practical  COntrast  statistical  significance  with 

practical  or  theoretical  significance  highlight  this  necess-ifv  /e  rt 

Jacobson,  Roberts,  Herns,  &  McGlinchey,  1999;  Thompson  2002) 

1979^0thelnClUde  bin°mial  8£feCt  size  display^Rosenthal  &  Rubin 
IZH'  £he  y™on  lavage  ES  (CL;  McGraw  a  Wong,  1992),  the  receiver' 

1988)  and C?hraCtef1SCiC  CUrVS  (Lett'  Hanley-  S  Smith,  1395;  Swets 
■  '  a.tbe  number  needed  to  treat  (Ebrahim,  2003).  For  example’  Cl 

experimental  1  lty  an  observation  selected  randomly  from  an 

raLomirfrom9ar°cL^“  ST”,”  “leCtea 

theWMmeth™irresutrtai0cL  ”iU  l*™  ^ 

Hiff„  f  result  has  clear  mturtive  meaning.  Also,  the 

difference  between  CL  -  75%  and  nr  -  ^  7  ' 

Into0S:rbee“r«;rorLe?eSsrrPr^^ee“rah7iln¥i"  ES 
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tiCoSrofnc?r»intY  in,the  ES  estimates-  ™is  allowance  could  take 
uie  torn  of  cis  expressed  m  a  CL  ES  metric. 

knnwi^°nS1-St  K1!  reporting  of  ES  would  support  the  growth  of  reliable 
Sat S';n  behfvloral  ^search.  Improvement  in  this  aspect  o£ 
fn  “  1  f  practice  would  ensure  that  enough  information  was  reported 
fie?dPPMer  meta:analYSis  °f  Che  cumulative  body  of  evidence  in  a 

influences'  on^i  m°del  math°dological  and  substantive 

influences  on  ES .  Several  meta-analytic  methods  developed  for  this 

tTb?sa  l  ed5e“.?“in'  1985;  Hunter  ‘  Sctaidt,  1990 ;  TsentLl 

produce  similar  results  (Schmidt  &  Hunter,  1998). 

Meta-analysis  generally  is  used  to  evaluate  correlations  or  other 

1994  hT®  ,San  bS  converted  to  correlations  (Cooper  a  Hedges 

f4'aHadgSS  &  01kin'  1985)-  However,  meta-analytic  methods  can  be’ 

of  estimate  (SEEsf ' Tr  ^  Standard  deviations  and  standard  errors 

ot  estimate  (SEEs)  (Raudenbush  &  Bryk,  2002,  chapter  7)  These 

Thee^ariables°twreCeHV\inCreaSed  attention  in  future  meta-analyses. 
The  variables  that  predict  ES  in  a  meta-analysis  are  analoaous  to 

moderators  m  traditional  moderator  analysis.  Restriction  of  ranae  and 

other  factors  can  produce  the  appearance  that  a  moderator  effect  is 

present  when  it  really  is  not  (Zedeck,  1971).  Meta-analysis  too  can 

samn?  Ced  bY  theSS  f actors '  Extending  meta-analysis  to  cover' 
sampling  variance  reduces  the  risk  of  incorrect  inferences  With  this 

s^iSbl'^forlsi^h1^13  Can  Pr°Vide  Paramet-  estimates  that  are 
suitabie  for  SST.  These  estimates  would  move  behavioral  research 

b°WaS  riakY  hyP?theSis  tests  that  could  provide  the  evid^S  needed 
to  make  strong  claims  for  a  model.  neeaea 

Full  realization  of  the  potential  value  of  meta-analvsis  mav  be 
ampered  by  the  appearance  that  meta-analysis  is  too  complex  for  the 
average  researcher.  This  appearance  is  misleading  becffse  tL  bLfc 
analysis  procedures  are  no  different  than  those  fouS  pfimaff  dfta 

atTlTS  (Rfenthal  ‘  MMatte°'  2001).  special  issues  tfaf“^n^e 
to  meta  analysis  are  described  in  Cooper  &  Hedges  (1994) 

^  ‘  -  -aSn-X  £  “  - 

The  Future  of  NHST  and  ES 

The  preceding  comments  identify  opportunities  to  improve  on 
urren  practices  by  reducing  the  emphasis  on  NHST  and  increasina  the 

report ing°of "ci ^oul d^s ind 1 cas  o£  model  effectiveness .  Consistent 
reporting  of  Cl  would  support  movement  toward  SST  by  facilitatina 

“?fld  PrdVide  che  —t^estSs 

based  on  cumuiative  *tr0n9er  m0delS 

coupled  with  discussions  that  ^^1^^^ 
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findings.  The  development  of  models  based  on  formal  analysis  of  the 
cumulative  empirical  evidence  should  foster  consensus  on  the  evidence. 

Movement  toward  the  use  of  CL  ES  metrics  is  another  factor  that 
s  ould  foster  consensus.  Developments  in  this  area  would  promote  a 

t^or^tnaTST  °f  thS  SSt  °f  paramete^  ^  a  model  mean  in 

Th^  °f,  a^tual  b^haviors  that  are  the  true  focus  of  model  building 

The  gap  between  abstract  statistical  indices  and  actual  behavior  is 
clear  and  must  be  addressed  m  practice.  Consensus  on  the  choice 

in  Dractiie6^^1^  m°dels  wil1  not  result  directly  from  these  changes 
'  bUt  lfc  1S  reasonable  to  hope  that  the  bases  for  arguments 

clearly1  fferent"  m°delS  Wl11  be  co™nicated  to  practitioners  more 

reported  more  consistently  in  the  future.  Melton's 
(1962)  editorial  on  significance  tests  is  commonly  cited  as  evidence 
of  the  pressures  that  made  NHST  an  important,  sometimes  critical 
requirement  for  publication.  similar  pressures  are  mounting  for'ES 
9nmCa£°r?  that  arS  reported  sporadically  at  present  (Finch  et  al 

1o!rnaK^'  "V *ss.  1999).  A  growing  number  of 
3,°^  ,  ^  edltorial  Policies  that  require  additional  information 

dler  Sc  Thompson,  2001)  .  Previous  experience  suggests  that  change 
may  ^e  s  ow  (cf..  Finch  et  al.,  2001),  but  the  increasing  frequency  of 
meta  analyses  should  stimulate  more  consistent  reporting.  The  work  of 
researchers  who  do  not  report  ES  -  or  who  fail  to  provide  enough 
information  to  compute  ES  -  will  ultimately  be  excluded  from  the 
cumulative  body  of  evidence. 

and  SlTld  f°Ster  improved  inference  about  the  adequacy 

d  utility  of  models.  Interpretation  may  be  improved  by  combining 

?SrSSS  t0waJd  clinical  ES  measures  with  the  recoirmended  use  of  Cl 
hese  approaches  could  be  combined  to  present  findings  graphically  in 

G?aohic  r  6  direct  clinical  on  applied  meaning  and  utility. 
Graphica!  presentation  that  fosters  better  corrmunication  of  research 

dings  is  one  index  of  the  scientific  maturation  of  a  field  (Smith 

decr4afeUthe'needhfbald'  &  Roberson-Nay '  2002>-  Both  trends  should  ' 
..f®  ®  th®  need  for  Practitioners  to  apply  arbitrary  statistical 

models  making  judgments  about  the  behavioral  implications  of 

The  increased  use  of  Bayesian  statistics  will  also  support 
improved  inference.  Elements  of  Bayesian  reasoning  already  are  present 

2002?  Sfr  analYS1S  methods  (e.g.,  HLM;  cf .  Raudenbush  &  Bryk, 
the  NHST  debf?eqmInCy  whioh _ Bayesian  reasoning  is  discussed  in 

.  Y  increase  familiarity  with  this  approach  to 

prSaVtarrSr  ^“d”  °£  h°"  t0  SPSCifY  Pri°r  Polities  is  the 

“ider  use  of  Bayesian  models.  Recent  summaries  of 

Meyer  et  al  mul  b  'Ple  .  meta-analyses  (Lipsey  &  Wilson,  1993; 

Y  d  1-'  2001 )  provide  some  leverage  for  this  problem  These 

r/aiuTy'  an  empirically  grounded  a  priori  estimate  Vhe 

^Efron  &  Morris10?9?7|ES  £0f  behavioral  research.  stein's  paradox 

Morns,  1977)  can  be  applied  to  invoke  this  distribution  as  a 
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proxy  for  the  true  priors  in  new  research  domains.  Thinking  along 
thinking'165  ^  MlST  with  more  appropriate  inferential 


rln^°n>f.  Uidercurrent  ^  the  NHST  debate  merits  special  mention  to 
ooe  this  topic.  Data  analysis  should  not  be  a  ritual.  Data  analysis 
is  only  one  element  m  the  overall  process  of  empirically  testing 

^hS°^HeSeSaandnm0delS'  Existin9  theory  and  prior  research  findings 

steo  in9th  thS  Pr°CSSS  at  a11  timSS-  Jndsment  is  needed  at  each 

coSecMv  i-PrOCeSS  t0  Pr°dUCe  dSSignS  and  analyses  that 

.  ectly  link  data  to  research  questions.  Researchers  routinely  use 

judgment  m  the  complex  activities  of  formulating  hypotheses  and 

developing  research  designs  (Kirk,  1996).  The  best  overall  statement 

regards  NHST  at  present,  therefore,  appears  to  be  this,  Judgment 

should  not  be  suspended  during  the  data  analysis  phase  of  research. 


SEM  Appraisal  Methods 

Traditional  model  evaluation  methods  will  persist  until  a 
reasonable  alternative  is  available.  SEM  practices  are  considered  in 
some  detail  here  because  they  are  the  product  of  a  quarter  century  of 
developing  an  approach  to  model  evaluation  that  minimizes  reliant  on 
gnificance  testing.  Also,  the  increasing  use  of  SEM  in  behavioral 
research  demonstrates  the  attractiveness  of  these  methods 
Researchers  should  be  motivated  to  learn  new  model  appraisal 

aS-“?rLSodoLS°CeSS  °£  aCqUiring  £“iliarlty  with  this  new 
General  Appraisal  Processes 


SEM  appraisals  involve  three  general  criteria.  SEM  analogues  of 
significance  tests  and  explanatory  power  are  coupled  with  indicators 
of  misfit  between  modeis  and  data.  Significance  tests  play  a  minor 
ole  m  SEM  appraisals.  in  this  context,  the  confounding  of  ES  and 
ample  size  has  been  an  explicit  concern  for  20  years  (Hoelter  1983) 

priori li0to°tfhehiS  the  USe  °f  signiflcSce  t^L  ' ’ 

Parameter  el Li  asse^sment  of  individual  parameters  within  models. 

>20 evaluation  typically  employs  Joreskog  and  Sorbom's  (1981)  t 
~  2 -°0  criterion.  This  criterion  approximates  the  p  <  05  standard 

commonly  used  in  NHST.  This  practice  is  primarily  important  “ 
deciding  model  details  rather  than  in  evaluating  the  model  as  a  whole 
Earlier  comments  on  NHST  and  SST  apply  to  this  element  of  SEM 
appraisal  and  will  not  be  repeated  here. 


SEM  programs  report  more  than  20  GFIs  that  describe  the  overall  fit 

conceptual^nd*/"1  ^  ^  Classification  schemes  bIS  on 

Arbuckle  &  Wothke  “S'lanaka  ^  ^  develoPed 

have  shown  '  93)-  However'  simulation  studies 

have  shown  that  different  GFIs  are  correlated  when  compared  across 

pl  The  emPirical  pattern  of  associations  suggests  two  general 
GFI  categories  (Hu  a  Bentler,  1998).  One  category  contains  S« 
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analogues  of  PRE  indices.  Cross-validation  indices  fall  within  this 
category.  The  second  GFI  category  consists  of  measures  that  are 
analogous  to  SEE  in  regression  analyses.  The  empirical  clustering  of 
GFIs  is  one  reason  for  current  recommendations  that  investigators 
report  more  than  one  GFI  when  evaluating  SEMs .  The  recommended 
practice  is  to  report  at  least  one  index  from  the  clusters  analogous 
to  PRE  and  SEE  (e.g.,  Bentler  &  Dudgeon,  1996;  McDonald  &  Ho,  2002). 

There  is  not  yet  a  strong  consensus  on  the  best  PRE  measure  for 
SEM.  The  RMSEA  (Arbuckle  &  Wothke,  1999)  has  been  recommended 
(Bentler  &  Dudgeon,  1996;  Hu  &  Bentler,  1998).  RMSEA  has  a  population 

dfc^^tatl°C  S°  CIS  Can  bS  comPuted-  The  population  interpretation 
of  RMS^A  can  be  used  to  test  hypotheses  about  the  fit  of  the  model. 

pS?  pJr0g^ms  often  report  a  "p  (close)"  test  that  compares  the  observed 
RMSEA  with  a  null  hypothesis  of  RMSEA  =  .05.  Fabrigar  et  al.'s  (1999) 
recent  recommendation  that  RMSEA  should  be  used  in  the  evaluation  of 
EFA  may  indicate  movement  toward  a  consensus. 

The  standardized  root  mean  square  ( SRMR)  is  the  recommended  SEM 
analogue  of  SEE  (Bentler  &  Dudgeon,  1996;  Hu  &  Bentler,  1998; 

MacCallum  &  Austin,  2000) .  SRMR  reflects  the  standardized  difference 
between  observed  covariances  and  the  model  estimates  of  those 
covariances . 


Simulation  studies  indicate  that  RMSEA  and  SRMR  provide  different 
types  of  information  about  models.  In  these  simulations  the 
experimenter  defines  the  true  population  model.  Models  with  known 
errors  (i.e.,  omitted  parameters,  added  parameters)  then  are  fitted  to 
the  data.  GFI  measures  are  evaluated  by  determining  how  sensitive 
they  are  to  the  errors.  In  such  simulations,  SRMR  has  been  sensitive 
to  errors  m  the  path  model  component  of  SEMs  in  simulations  (Hu  & 
Bentler  1998).  RMSEA  has  been  sensitive  to  errors  in  the  measurement 
model  (Fan,  Thompson,  &  Wang,  1999;  Hu  &  Bentler,  1998) .  Neither  the 
number  of  factors  nor  the  number  of  indicator  variables  affects  RMSEA 
when  the  model  is  correctly  specified  (Cheung  &  Rensvold,  2002). 

Users  should  be  aware  that  RMSEA  and  SRMR  can  yield  different 
conclusions  about  a  model.  This  is  not  surprising  given  that  these 
indices  provide  different  types  of  information.  Browne,  MacCallum, 
lm,  Andersen,  and  Glaser  (2002)  describe  the  conditions  that  produce 
this  disparity,  when  conflicts  occur,  trade-offs  between  these 
criteria  may  be  required.  For  example,  the  available  simulation 
evidence  might  be  used  as  a  guide.  if  so,  SRMR  would  be  given  greater 
weight  when  evaluating  path  models.  RMSEA  would  be  given  greater 
weight  when  evaluating  measurement  models.  This  approach  to  weighting 
e  criteria  assumes  that  models  must  lead  to  the  adoption  of  a  single 
model.  one  alternative  would  be  to  treat  the  criteria  as  equivalent 

the  modei1Ul  thS  ^UdY  dld  nQt  makS  iC  possible  to  choose  between 

eoL  smallest  RMSEA  and  the  model  with  the  smallest 

•  Retainind  m°re  one  model  may  be  preferable  to  premature 

adoption  of  a  single  alternative  as  "the"  model. 


40 


Statistics  and  Model  Construction 


SEM  Appraisal  Issues 

As  mentioned  at  the  beginning  of  this  section,  SEM  appraisal 
practices  raise  issues  that  are  not  always  evident  in  other  types  of 
ana  ysis.  As  a  consequence,  SEM  appraisal  does  not  begin  and  end  with 
the  examination  of  one  or  two  statistical  indicators  for  model 
adequacy.  Satisfactory  assessment  of  a  model  must  also  consider  other 
issues.  Some  important  general  topics  in  model  evaluation  are 
examined  here  under  the  heading  of  appraisal  issues. 

Steps  in  Modeling.  One  important  model  appraisal  issue  is 
highlighted  by  recommendations  that  measurement  models  be  defined  and 
evaluated  before  estimating  path  models  (Anderson  &  Gerbing,  1988). 

The  initial  proposal  of  this  two-step  procedure  stimulated  debate  on 
t  e  strengths  and  weaknesses  of  the  approach  (Anderson  &  Gerbing 
1992;  Fornell  &  Yi,  1992).  McDonald  and  Ho  (2002)  raised  the  issue 
again  and  demonstrated  that  the  fit  of  the  two  models  can  be  quite 

different.  That  demonstration  should  spark  renewed  interest  in  the 
topic . 


Good  overall  fit  for  a  model  means  that  it  reproduces  at  least 
some  parts  of  the  data  well.  However,  overall  fit  can  conceal 
significant  misfit  m  specific  elements  of  the  model  when  a  few  large 
errors  are  averaged  with  a  number  of  much  smaller  errors  if  the 
large  errors  are  scattered  throughout  the  covariance  or  correlation 
being  analyzed,  there  may  be  no  problem.  However,  there  is  no 
guarantee  that  the  errors  will  not  be  focused  in  specific  areas  of  the 
model.  Inaccuracies  in  the  measurement  model  do  not  have  the  same 
implications  for  theory  as  do  inaccuracies  in  the  path  model.  A  weak 
measurement  model  means  that  the  current  model  does  not  satisfy  one  of 
several  conditions  that  must  be  met  to  obtain  meaningful  tests  of 
substantive  hypotheses  (Meehl,  1990a).  The  hypothesized  relationship 
still  might  be  demonstrated  by  refining  the  measurements  or  by 
substituting  other  measurement  procedures  if  available.  in  fact 
demonstrating  that  the  same  associations  and  lawful  relationships 
between  theoretical  constructs  can  be  derived  using  different 
TQ?oVremSb  models  is  one  hallmark  of  reliable  knowledge  (Ziman, 

,™n^POint  1S  ?0t  always  aPP^ciated  in  behavioral  research, 
v  2002  argues  that  research  paradigms  often  come  to  be  equated 
with  the  theoretical  constructs  they  are  intended  to  measure.  A 
general  construct  thereby  is  reduced  to  a  specific  set  of  operational 
definitions,  including  a  specific  measurement  model,  when  different 
researchers  develop  different  paradigms  to  study  the  same  construct, 
each  paradigm  can  become  the  center  of  a  research  program.  Different 
programs  then  proceed  in  parallel  rather  chan  being  directly  compared, 
e  collected  set  of  paradigms  then  may  be  combined  to  represent  the 

1Ca LTf rUCt  35  a  syndrome-  Separate  measurement  and  path 
“  W°U  d  h®lp  to  clarify  the  role  of  measurement  paradigms:  Do 
ferent  paradigms  produce  equivalent  estimates  of  the  relationships 

rllllTlJ Tel  ^  C°nst;ucts?  I£  SO'  ^ogress  is  being  made 
eliable  knowledge.  In  this  context,  measurement  methods  are 
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auxiliary  models  that  must  be  reasonably  acourate  in  order  to  test 
theoretical  assertions  (Meehl,  1990a). 

the  SEM  rWe0a?TnteP  ReValUati°n  13  implicit  in  current  practices  outside 
the  SEM  realm.  Regression  and  ANOVA  methods  typically  define 

predictor  and  criterion  measures  prior  to  analysis.  Consideration  of 

the  measurement  model  may  be  limited  to  reference  to  previous Judies 

that  established  the  measurement  adequacy  of  the  scales.  Direct 

demonstration  of  measurement  adequacy  in  the  present  sample  is  not 

Ts TOT'"  More.c°nsiste“  attention  to  thiS  issJe  ?Juld 
reduce  the  risk  of  inappropriate  generalization.  This  risk  is  a 

neglected  dilemma  for  behavioral  researchers  (Blalock,  1982)  Neglect 
renders  the  dilerrma  invisible,  but  does  not  eliminate  it 

Considered  m  this  context,  any  process  of  principled  arqument 
must  focus  on  measurement  issues  at  some  point.  Routine  use  of  a  two- 

this  pStYofSthe°aedUre  Tn  °rganize  the  empirical  evidence  bearing  on 
this  part  of  the  argument.  For  this  reason,  it  seems  likelv  that 

practice  ultimately  will  move  in  the  direction  of  separating 

measurement  model  assessments  (i.e.,  scale  construction)  Tom  path 

odei  assessments  (i.e.,  tests  of  substantive  hypotheses),  a  review 

o  Anderson  and  Gerbing's  (1988)  arguments,  subsequent  debate 

200?r°n  &  ^erbing(  1992;  Fornell  &  Yi,  1992),  and  McDonald  and  Ho's 
-p  ■  -u  recei]  exposition  of  the  issues  will  provide  researchers  with  a 
firm  basls  for  determining  how  critical  this  issue  is  to  Sy 
particular  research  problem.  y 

°/  measurement  Error.  SEM  evaluations  also  direct 
attention  to  the  effects  of  measurement  error.  SEM  programs  provide  R2 
m^dT  Z  Zent  traits  thac  dependent  variable?  ?n  SS  S 

JSSarvJ?  ValUSS  likelY  “  be  Stron^r  ‘han  those  found  in 
ordinary  regression.  This  difference  can  be  attributed  to  removina 

the  effects  of  measurement  error  (Bollen,  1989).  in  effect  SEM 

ffiUdeS  correctlons  for  the  attenuation  of  associations  that  result 
measurement  error  ^cf-'  Nunnally  &  Bernstein,  1994)  The 

t-SUl^ng  ^““ates  may  be  closer  to  true  population  values  than  are 
te?Uated  estimates,  but  this  apparent  benefit  should  be  viewed 
with  caution  (Bedeian,  Day,  &  Kelloway,  1997).  This  potential 

moSf  °f  SEM  analyses  is  not  always  evident  because  R2  for  the  path 

model  hadSbef  0rdifJily  . -ceive  as  much  attention  as  it  would Tf  The 
model  had  been  created  usrng  regression  techniques.  These  statistics 

?J?ln?rilY  play2lltcle  Pant  in  SEM  model  evaluations.  For  exajipi? 
the  change  in  R2  resulting  from  dropping  a  parameter  ordinarTlT  is  not 
a  consideration.  The  potential  value  of  greater  attention  to  these 

assoc iat ion  1her*nCer  1?11  because  che  E*  indicators  of  strength  of 

when  they^uln?  "SS* 

SJif  JenCi°n  t0  thiS  in£°™ati°"  in  the  future will^vide  a 

the  UtUity  °f  the  path  "O0.1  as  an  SEM 
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Search  for  Areas  of  Misfit.  The  appraisal  of  model  misfit  should 
include  a  search  for  atypical  data  points.  In  some  cases,  a  few 
unusual  observations,  known  as  outlier  and/or  influential  data  points, 
heavily  influence  model  fit.  Roughly  speaking,  an  outlier  data  point" 
is  an  observation  with  an  exceptionally  large  residual.  The 
exceptional  residual  inflates  the  cumulative  error  variance  of  the 
model.  The  inflation  means  the  model's  explanatory  power/goodness  of 
fit  is  underestimated.  Influential  data  points  markedly  alter 
parameter  estimates  in  the  model.  For  example,  an  influential  data 
point  is  one  that  changes  the  regression  slopes  in  multiple 
regression.  The  influential  data  point  is  not  necessarily  an  outlier 
because  distorted  parameter  values  make  the  predictions  reasonably 
accurate  in  some  cases.  However,  the  model  parameters  are  less 
accurate  than  they  could  be  for  other  data  points.  The  overall 
accuracy  of  the  model  is  likely  to  decline.  Both  outlier  and 
influential  data  points  can  lead  to  models  that  do  not  accurately 
describe  the  population  under  investigation.  In  the  context  of 
behavioral  modeling,  the  resulting  model  may  lead  to  mistaken 
conclusions  about  the  causes  and  consequences  of  the  behavior  of 
interest . 

Outlier  and  influential  data  points  are  not  fatal  problems  in 
modeling.  Diagnostic  procedures  are  available  to  identify  influential 
and  outlier  cases  (cf . ,  Belsley,  Kuh,  &  Welsch,  1980;  Stevens,  1984) 
Chatterjee  and  Yilmaz  (1992)  review  the  use  of  these  methods  and 
provide  an  additional  example  of  their  application.  These  methods  are 
available  in  many  regression  programs,  but  are  not  generally  available 
in  SEM  or  HLM  programs.  In  those  cases,  preliminary  regression 
analyses  can  help  to  identify  exceptional  data  points.  The  sources 
that  describe  the  bases  for  the  diagnostic  indicators  provide  general 
guidelines  for  interpreting  the  statistics.  One  limitation  of  the 
available  indicators  is  that  they  may  be  insensitive  to  situations  in 
which  groups  of  data  points  affect  the  model  (Belsley  et  al . ,  1980). 
Robust  regression  produces  accurate  models  even  when  the  proportion  of 
contaminating  data  points  is  large  (Rousseeuw  &  Leroy,  1987).  This 
method  should  be  considered  for  data  screening  when  it  is  available, 
but  the  technique  is  not  widely  available  at  present.  Draper  and 
Smith  (1998)  describe  an  iterative  approach  that  addresses  this 
problem  without  the  need  for  specialized  analysis  packages.  The 
prediction  of  sum  of  squares  (PRESS)  approach  is  potentially  time- 
consuming  because  it  is  iterative,  but  the  effort  may  well  be 
worthwhile. 

Outlier  and  influential  data  points  may  even  have  positive 
effects  in  modeling.  These  exceptional  data  points  can  indicate  that 
the  data  include  cases  that  represent  two  or  more  distinct  populations 
(Barnett  &  Lewis,  1994) .  if  so,  separate  models  can  be  constructed 
for  each  population  once  appropriate  indicators  of  group  membership 
have  been  identified. 

Justification  for  Model  Amendments .  The  model  defined  at  the 
outset  of  a  research  project  seldom  is  wholly  satisfactory.  The 
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appraisal  typically  identifies  weaknesses  that  cannot  be  attributed 
entirely  to  chance  ox  to  exceptional  data  points.  Investigators  then 
must  choose  between  amending  the  model  and  accepting  it  as  "good 
enough . " 

The  "good  enough"  choice  should  not  be  neglected  (Serlin  & 
Lapsley,  1985)  .  This  option  is  more  likely  to  be  considered  in  SEM 
modeling  than  in  other  areas.  SEM  practices  set  stopping  rules  in 
terms  of  GFI  criteria.  The  most  common  cutoff  for  acceptable  fit  is 
GFI  -  -900  (Bentler  &  Bonett,  1980),  but  higher  standards  (i.e.,  GFI  > 
.950)  have  been  suggested  recently  (Hu  &  Bentler,  1999).  in  either 
case,  less  that  perfect  fit  is  considered  acceptable.  In  practice, 
the  standards  for  accepting  a  model  as  adequate  are  somewhat  lower. 
Optimistic  interpretation  of  fit  indices  is  common  (Bentler  &  Dudgeon, 
1996) .  Care  must  be  taken  to  ensure  that  the  criterion  for  "good 
enough"  is  not  set  so  low  that  it  impedes  the  search  for  improvements 
in  mediocre  existing  models. 

Post  hoc  model  modifications  take  two  forms.  The  most  common  is 
the  addition  of  parameters  to  improve  the  predictive  accuracy  of  the 
model.  Additions  are  philosophically  defensible  (Meehl,  1990a),  but 
the  modification  process  must  be  sensitive  to  the  risk  of  capitalizing 
on  chance.  For  example,  in  SEM,  a  search  through  all  the  constrained 
parameters  is  likely  to  capitalize  on  chance  (MacCallum,  Roznowski,  & 
Necowitz,  1992).  The  same  problem  arises  in  regression  (Thompson, 
1995).  Decisions  regarding  model  modifications  should  be  sensitive  to 
the  effects  of  chance.  The  basic  approach  is  to  set  a  more  extreme 
significance  standard  (e.g.,  p  <  .01  rather  than  p  <  .05).  Methods  of 
testing  post  hoc  contrasts  in  ANOVA  may  be  the  most  familiar  example 
of  this  approach  (Winer  et  al . ,  1991).  Keselman  et  al.  (1999)  and 
Seaman  et  al .  (1991)  compare  several  approaches  that  could  be  used  in 

regression.  Green,  Thompson,  and  Poirer  (2001)  demonstrate  the 
utility  of  this  approach  in  SEM. 

Model  amendment  should  not  rely  solely  on  significance  tests. 
Modifications  should  also  consider  ES  and  theory.  Kaplan  (1990a, 

1990b)  proposed  combining  ES  and  significance  (i.e.,  modification 
index)  to  determine  when  to  add  parameters  to  SEM.  Experts  who 
commented  on  this  proposal  noted  that  parameters  should  not  be  added 
without  sound  theoretical  justification  (Bollen,  1990;  Steiger,  1990). 
The  rationale  for  that  assertion  applies  to  all  types  of  models. 

Models  can  also  be  modified  by  deleting  parameters.  Deletion 
fixes  a  parameter  that  had  a  non-zero  value  in  the  original  model  at 
zero  in  the  revised  model.  In  SEM,  parameters  with  t  <2.00  often  are 
deleted  (Joreskog  &  Sorbom,  1981).  Parameter  deletion  procedures  can 
be  implemented  easily  in  regression.  Backward  stepwise  regression 
performs  these  modifications  automatically  by  removing  the  predictor 
with  the  least  predictive  value  and  then  estimating  a  new  regression 
with  the  remaining  variables.  This  process  is  repeated  until  all 
remaining  parameters  are  acceptable  (e.g.,  p  <  .05).  Backward 
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stepwise  regression  provides  an  analogous  model  amendment  procedure 
for  regression  models. 

Deletions  reduce  the  number  of  parameters  in  a  model.  The  result 
is  greater  parsimony  (i.e.,  a  model  with  fewer  parameters;  cf .  Popper, 
1959).  However,  the  search  for  simplicity  should  apply  the  same 
principles  used  when  deciding  whether  to  add  parameters.  The  effect 
on  predictive  power  and  the  implications  for  theory  should  be 
considered.  The  effects  of  chance  also  should  be  considered.  When 
examining  a  set  of  parameters,  one  or  more  of  the  parameters  can 
appear  to  be  no  greater  than  zero  by  chance.  Allowances  should  be 
made  for  this  risk  just  as  one  would  allow  for  chance  effects  when 
adding  a  parameter.  This  problem  does  not  appear  to  have  been 
addressed  in  the  literature,  but  it  is  likely  that  the  methods  used  to 
c^void  improper  addition  can  be  adapted  to  avoid  incorrect  deletions. 

The  preceding  sketch  of  model  modifications  points  to  two  general 
principles .  First,  multiple  criteria  should  be  used  when  deciding 
whether  to  amend  a  model.  The  criteria  include  statistical 
significance,  ES,  and  theory.  Theory  must  be  emphasized  to  avoid 
letting  the  statistical  tail  wag  the  theoretical  dog.  Second,  the 
same  criteria  apply  to  additions  and  deletions  from  the  model. 

However,  significance  tests  should  focus  on  Type  II  error  (vs.  Type  I 
error)  when  considering  deletions. 

Justifying'  Claims  for  Model  Generality .  Fitting  a  model  to  data 
yields  a  set  of  equations.  The  parameter  values  in  the  equations 
optimize  the  fit  of  the  model  in  the  specific  data  set.  Optimization 
is  influenced  by  the  effects  of  chance  on  the  pattern  of  covariation 
in  the  data.  Optimization  also  may  be  affected  by  the  fact  that  the 
data  were  sampled  from  a  specific  population.  Generalization  tests 
explore  the  effects  of  chance  and  population  differences  on  model 
structure . 


Generalization  is  always  an  issue  in  behavioral  research 
(Blalock,  1982).  in  military  research,  one  might  ask  whether  the  same 
model  applies  to  men  and  women,  to  different  ethnic  groups,  to 
different  occupations,  and/or  to  different  military  services.  For 
example,  does  general  intelligence  (i.e.,  psychometric  'g')  predict 
job  performance  equally  well  for  all  military  occupations? 

Generalization  encompasses  cross-validation  and  moderator 
analysis.  Cross-validation  applies  a  model  developed  in  a  sample 
drawn  from  a  particular  population  to  a  different  sample  from  the  same 
population.  Moderator  analysis  compares  models  across  samples  drawn 
from  different  populations  and/or  situations.  in  both  cases,  the 
question  is  whether  the  model  varies  substantially  from  one  sample  to 
another . 

Browne  (2000)  provides  an  overview  of  cross-validation  issues  for 
types  of  analysis.  His  review  describes  model  development 
as  consisting  of  calibration  and  validation  phases .  However,  current 
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guidelines  use  the  term  "validity"  to  refer  to  the  appropriateness  of 
the  interpx e ta t ions  of  test  scores  (APA,  1985)  .  By  extension,  model 
validity  would  refer  to  the  appropriateness  of  the  interpretations  of 
model  parameters.  This  aspect  of  modeling  can  be  realigned  by 
characterizing  the  examination  of  sources  of  parameter  variation  as 
generalization  tests.  Instead  of  linking  parameter  variation  to  model 
interpretation  (i.e.,  validity),  the  realignment  emphasizes  the 
legitimate  scope  of  application  of  the  model.  This  shift  highlights 
the  affinity  between  parameter  variation  and  Cronbach  et  al.'s  (1972) 
generalizability  approach  to  test  scores. 

The  effects  of  optimizing  the  fit  of  the  model  to  a  single  sample 
can  be  estimated  directly  from  results  obtained  in  fitting  the  initial 
model.  The  shrunken  R2  printed  out  in  many  regression  and  GLM  programs 
is  the  most  familiar  example  of  this  approach.  A  number  of  indices  to 
estimate  the  population  accuracy  of  a  model  have  been  developed  for 
regression  (Raju,  Bilgic,  Edward,  &  Fleer,  1997).  Users  should  be 
aware  that  some  formulae  estimate  the  population  multiple  correlation 
for  the  model;  other  formulae  indicate  the  R2  that  would  be  expected 
when  applying  the  model  to  a  new  sample  of  data  from  that  population. 
Sampling  variation  specific  to  the  new  sample  would  affect  performance 
in  the  latter  case.  Raju,  Bilgic,  Edward,  and  Fleer  (1999)  conducted 
a  simulation  to  compare  a  number  of  widely  used  formulae.  The 
formulae  performed  well  when  the  sample  size  was  at  least  moderately 
large  (i.e.,  N  k  100  or  so).  The  expected  cross-validation  index 
( ECVI ;  Browne  &  Cudeck,  1989)  is  the  analogous  SEM  index. 

Equivalent  Models.  Principled  argument  is  most  productive  when 
it  compares  competing  models.  Unfortunately,  model  comparison  is  not 
the  norm  in  behavioral  research  (Katzko,  2002).  As  a  result, 
behavioral  modeling  is  affected  by  confirmation  bias  and  insensitivity 
to  the  existence  of  equivalent  models. 

Confirmation  bias  is  a  prejudice  in  favor  of  the  model  under 
evaluation  (MacCallum  &  Austin,  2000)  .  Symptoms  of  bias  include 
overly  positive  evaluations  of  model  fit  and  a  "...  routine  reluctance 
to  consider  alternative  explanations  of  the  data"  (p.  213).  MacCallum 
and  Austin  (2000)  recommend  using  strategies  that  provide  for 
examination  of  alternative  models,  including  a  priori  specification  of 
multiple  models.  Based  on  a  review  of  recent  modeling  literature, 
these  authors  suggest  that  this  approach  is  followed  about  half  of  the 
time . 


A  search  for  alternative  models  is  likely  to  identify  equivalent 
models.  Two  models  are  equivalent  if  they  are  of  equal  complexity  and 
fit  the  data  equally  well.  Models  have  equal  complexity  if  they  have 
the  same  number  of  parameters.  MacCallum,  Wegener,  Ueltino,  and 
Fabrigar  (1993)  found  equivalent  models  in  46  of  53  studies  they 
examined.  The  median  number  of  equivalent  models  was  between  12  and 
21,  depending  on  the  research  area.  The  model  differences  were  not 
trivial  from  a  theoretical  perspective.  Many  alternative  models  had 
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very  different  substantive  interpretations  than  the  model  adopted  in 
the  original  study. 

MacCallum  et  al.'s  (1993)  review  understates  the  magnitude  of  the 
problem.  That  review  used  methods  developed  by  Stelzl  (1986)  and 
extended  by  Lee  and  Hershberger  (1990)  to  identify  equivalent  models. 
Raykov  and  Penev  (1999)  recently  showed  that  those  approaches  are 
special  cases  of  more  general  conditions  for  model  equivalence.  A 
search  for  all  models  that  satisfy  these  general  conditions  would  be 
expected  to  increase  the  MacCallum  et  al .  (1993)  estimate  of  the 

number  of  equivalent  models  per  study. 

Models  that  are  literally  equivalent  should  not  be  the  only 
concern  when  attempting  to  avoid  confirmation  bias.  Other  models  may 
fit  the  data  nearly  as  well  as  the  best  model (s).  The  population 
interpretations  of  some  SEM  indices  (e.g.,  RMSEA,  ECVI)  make  it  clear 
that  a  sample  yields  an  estimate  of  the  fit  between  the  model  and  the 
data.  The  true  population  value  of  the  GFI  is  most  likely  to  fall  in 
the  range  defined  by  the  Cl.  Other  models  that  are  not  literally 
equivalent  to  the  current  model  will  have  GFI  values  that  fall  within 
the  Cl .  These  models  should  be  considered  along  with  any  literally 
equivalent  model (s).  Special  attention  should  be  given  to  models  that 
fit  nearly  as  well  even  with  fewer  parameters  than  the  current  sample- 
optimal  model.  A  trade-off  between  model  accuracy  and  the  number  of 
parameters  is  the  heart  of  the  parsimony  issue  raised  by  Mulaik  et  al . 
(1989)  . 

The  statistical  toolbox  includes  search  methods  to  identify 
equivalent  models.  Some  regression  programs  offer  an  "all  possible 
subsets"  routine.  This  method  considers  all  possible  combinations  of 
the  available  predictors  within  limits  set  by  the  researcher.  For 
example,  models  might  be  restricted  to  combining  no  more  than  five  of 
eight  available  predictors.  Large  numbers  of  models  are  fitted  to  the 
data  even  with  these  restrictions.  It  often  will  be  the  case  that 
several  models  offer  comparable  explanatory  power.  Mallows 's  (1973) 

Cp  is  a  statistic  that  provides  a  parsimony  index  for  regression.  Cp 
can  be  used  to  choose  between  alternatives  (see  Draper  &  Smith,  1998) . 

The  model  search  problem  is  more  complex  in  SEM.  The  TETRAD 
program  (Glymour,  Scheines,  Spirtes  &  Kelly,  1987;  Spirtes,  Glymour,  & 
Scheines,  1993)  provides  tools  that  permit  the  computer  to  search  for 
alternative  models.  The  current  version  of  the  program  permits  the 
research  to  specify  constraints  on  the  search  in  terms  of  background 
knowledge.  The  background  knowledge  may  include  information  about 
whether  the  population  SEM  includes  latent  traits  or  correlated 
errors,  the  time  ordering  of  the  variables,  any  established  causal 
relationships,  and  causal  relationships  that  are  known  not  to  hold  in 
the  population  (Scheines,  Spirtes,  Glymour,  Meek,  &  Richardson,  1998). 
Scheines  et  al .  (1998)  describe  the  basic  rationale  for  their  approach 

and  its  implementation  in  TETRAD  II  in  a  special  issue  of  Multivariate 
Behavioral  Research,  which  includes  commentary.  The  initial  TETRAD 
approach  was  compared  with  other  search  tools  in  a  special  issue  of 
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Sociological  Methods  and  Research  (Spirtes,  Scheines,  &  Glymour, 

1990),  also  with  attendant  commentary. 

The  future  may  see  tools  such  as  TETRAD  combined  with 
developments  such  as  Raykov  and  Penev's  (1999)  delineation  of  general 
conditions  for  identifying  equivalent  models.  Separately  or  in 
combination,  these  tools  make  it  possible  to  explore  the  problem  of 
specifying  equivalent  models  more  systematically.  Constructive 
applications  of  these  tools  could  address  limitations  of  existing 
research  (Bentler  &  Dudgeon,  1996;  MacCallum  &  Austin,  2000)  to  bring 
practice  in  line  with  recent  recommendations  for  the  proper  conduct 
and  reporting  of  SEMs  (Boomsma,  2000;  McDonald  &  Ho,  2002). 

Causal  Interpretations .  Model  construction,  appraisal,  and 
amendment  yield  one  or  more  sets  of  equations.  Each  set  represents  a 
plausible  alternative  model.  The  sets  of  equations  often  are  rendered 
visually  as  path  diagrams  that  include  unidirectional  arrows 
representing  hypothesized  causal  effects.  Thus,  the  mathematical 
statements  are  routinely  given  causal  interpretations  despite  cautions 
against  this  practice  (Breckler,  1990;  Roesch,  1999) .  These 
interpretations  should  be  sensitive  to  two  challenges  that  are  related 
to  causal  inference. 

Incomplete  models  are  one  source  of  concern.  Model  parameters 
often  are  interpreted  as  indicating  the  amount  of  change  in  the 
dependent  variable  that  would  be  observed  if  a  predictor  were  changed 
by  one  unit.  This  interpretation  will  err  if  the  parameter  estimate 
is  biased.  Any  omitted  variable  produces  bias  if  it  has  a  causal 
influence  on  a  dependent  variable  and  is  correlated  with  one  or  more 
predictors  in  the  model  (James  et  al . ,  1982).  The  extreme  case  is  a 
spurious  relationship.  A  spurious  relationship  arises  when  omitted 
variables  are  the  entire  basis  for  the  association  between  a  model 
predictor  and  a  dependent  variable  (Kenny,  1979).  James  et  al.  (1982) 
discuss  methods  of  reducing  the  risk  of  omitted  variable  bias. 

Philosophical  issues  remain  even  after  omitted  variable  bias  has 
been  ruled  out.  The  general  problem  can  be  illustrated  by  considering 
the  interpretation  of  results  from  a  true  experiment.  In  this  case, 
it  is  impossible  to  directly  demonstrate  a  causal  effect  on  an 
individual.  This  demonstration  would  require  observing  the  person  as 
he  or  she  would  be  after  receiving  the  treatment  and  as  he  or  she 
would  be  without  the  treatment.  Only  one  of  these  two  conditions  can 
actually  be  observed,  so  a  causal  effect  cannot  be  established  for  any 
given  individual.  However,  in  a  true  experiment,  it  is  possible  to 
estimate  the  average  treatment  effect.  This  parameter  is  an  unbiased 
estimate  of  the  average  of  unit  effects.  Sobel  (1996,  2000)  discusses 
these  issues  in  greater  detail.  In  the  context  of  typical  behavioral 
modeling  efforts,  the  advice  of  the  American  Psychological  Association 
Task  Force  on  Statistical  Inference  should  be  kept  in  mind:  "... 
especially  when  formulating  causal  questions  from  non-randomized  data, 
the  underlying  assumptions  needed  to  justify  any  causal  conclusions 
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should  be  carefully  and  explicitly  argued  ..."  (Wilkinson  &  the  Task 
Force  on  Statistical  Inference,  1999,  p.  600). 

Graph  theory  provides  tools  to  address  causality  in  connection 
with  observational  data  (Glymour  et  al . ,  1987;  Pearl,  1998;  Spirtes  et 
al .  ,  1993)  .  This  approach  represents  the  measurement  and  path  models 
in  an  SEM  as  directed  graphs.  The  directed  graph  includes  the 
familiar  arrows  from  SEMs  as  hypothesized  causal  effects.  The 
directed  graph  can  have  testable  implications  such  as  disappearing 
partial  correlations  and  TETRAD  equations  (Glymour  et  al . ,  1987). 
Determining  whether  the  implied  equations  hold  in  the  data  tests  the 
plausibility  of  the  model.  This  approach  cannot  prove  that  any  given 
model  is  correct,  but  it  can  rule  out  some  competing  models  (Glymour 
et  al. ,  1987) . 

HLM,  CLDV,  and  LGCA  Models 

The  SEM  evaluation  issues  also  apply  to  HLM,  CLDV,  and  LGCA 
models  discussed  previously.  Recognizing  this,  there  appear  to  be 
opportunities  to  expand  on  current  practice  to  obtain  more  complete 
model  assessments.  For  example,  each  approach  to  modeling  produces 
residuals  that  can  be  evaluated.  However,  standard  analysis  packages 
may  not  include  methods  of  identifying  influential  data  points. 
Preliminary  regression  analysis  can  serve  this  purpose  (Raudenbush  & 
Bryk,  2002).  The  GFIs  from  SEM  can  be  applied  to  other  procedures 
that  yield  x2  values  as  indicators  of  model  fit.  Thus,  both  the  PRE 
and  misfit  indices  could  be  applied  to  other  areas  of  study.  Some 
movement  in  this  direction  is  already  taking  place.  For  example,  it 
has  been  suggested  that  the  explanatory  power  of  models  can  be 
expressed  in  terms  of  the  proportion  of  the  null  model  x2  explained  by 
a  substantive  model  (Agresti,  1996;  Long,  1997).  Attention  also  has 
been  given  to  examining  residuals  (Long,  1997). 

Despite  suggestions  to  the  contrary,  the  analysis  of  categorical 
variables  currently  emphasizes  significance  testing.  The  problem  of 
sparse  data  (i.e.,  many  empty  or  nearly  empty  cells  in  a  cross¬ 
classification)  is  a  contributing  factor  (Bartholomew  &  Tzamourani, 
1999;  Collins,  Fidler,  Wugalter,  &  Long,  1993;  Langeheine,  Pannekoek, 

&  van  de  Pol,  1996).  These  cells  can  bias  the  observed  x2  upward. 
Collapsing  cells  is  one  means  of  reducing  this  problem,  but  this 
approach  discards  some  of  the  information  in  the  data.  Bootstrap 
methods  (cf.,  Efron,  1982)  that  avoid  this  loss  are  the  recommended 
means  of  generating  probability  distributions  for  choosing  between 
models . 

Model  Evaluation  and  MAGIC 

Model  evaluation  is  critical  to  principled  argument.  NHST 
provides  a  weak,  often  misleading,  basis  for  model  evaluation. 

Movement  toward  SST  is  desirable.  Meta-analysis  can  facilitate 
movement  toward  SST  if  ES  or  the  information  required  for  computing  ES 
is  reported  consistently.  However,  significance  testing  arguably  is  a 
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weak  model  appraisal  tool  with  limited  applications.  Increased  use  of 
other  indicators  of  model  adequacy  can  be  expected  in  the  future. 
Movement  away  from  purely  statistical  summaries  toward  common  language 
indicators  of  ES  and  predictive  power  would  be  constructive,  but 
changes  in  this  area  are  likely  to  lag  behind  changes  in  statistical 
practices . 

Changes  in  these  traditional  practices  are  likely  to  come  slowly 
until  alternative  methods  of  evaluation  are  available.  SEM  appraisal 
practices  provide  one  set  of  alternatives.  This  approach  emphasizes  a 
process  that  challenges  a  model  by  pointing  out  its  limitations  or  by 
suggesting  alternative  models.  Arguments  based  on  limitations 
include : 

•  "The  model  is  determined  by  influential  data  points  and 

outliers . " 

•  "The  model  capitalized  on  chance  in  stepwise  modifications." 

•  "The  model  includes  predictors  that  serve  no  useful  purpose." 

Arguments  based  on  alternative  models  include: 

•  "Other  models  account  for  the  data  as  well." 

•  "Other  models  are  more  parsimonious." 

Any  useful  model  must  have  statistically  significant  predictive 
power.  Using  NHST  as  the  basis  for  model  evaluation,  therefore, 
represents  the  application  of  a  minimum  standard  for  model  acceptance. 
SST  is  more  relevant  to  appraisal  when  models  progress  beyond  this 
minimum,  but  SST  results  apply  to  a  specific  model  as  it  is  currently 
formulated.  The  critical  appraisal  and  amendment  procedures  are  those 
that  counter  the  challenges  noted  above.  Methods  that  move  beyond 
significance  testing  are  needed  to  respond  to  those  challenges. 

Systematic  amendment  and  appraisal  processes  help  to  avoid  common 
weaknesses  in  the  modeling  process.  Confirmation  bias  is  a  critical 
problem  given  the  current  state  of  the  art.  Modeling  efforts  often 
focus  on  a  single  model.  The  search  for  alternative  models  is 
frequently  limited  to  adding  parameters  to  or  deleting  parameters  from 
a  base  model.  The  typical  result  is  a  final  model  that  differs 
trivially  from  the  initial  model.  Indeed,  the  modifications 
introduced  may  be  no  more  than  the  effects  of  chance  unless  special 
steps  are  taken  to  allow  for  the  number  of  significance  tests  involved 
in  the  modification  process.  The  GFI  or  PRE  for  the  model  is  likely 
to  be  interpreted  optimistically.  The  fact  that  subjectively 
plausible  ex  post  facto  explanations  can  be  offered  for  the  structure 
of  the  current  model  may  be  taken  as  evidence  of  its  credibility.  This 
practice  is  questionable  in  light  of  Armstrong  and  Soelberg's  (1968) 
demonstration  that  models  produced  by  random  data  can  be  given 
plausible  interpretations.  These  two  model  appraisal  tendencies  make 
it  likely  that  equivalent  models  will  be  neglected.  Models  implied  by 
alternative  research  paradigms  are  likely  to  be  ignored  completely 
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(Katzko,  2002)  .  Careful  attention  to  these  issues  in  model  appraisal 
and  amendment  can  substantially  strengthen  current  practices  and 
promote  principled  argument. 

The  amendment  and  appraisal  process  directs  attention  to  the 
interplay  of  different  elements  of  MAGIC.  These  components  of  the 
model  construction  process  emphasize  articulation  and  credibility. 
Articulation  is  critical  in  defining  alternative  models  and  justifying 
modifications  to  existing  models.  Showing  that  some  explanations  for 
data  are  less  plausible  than  others  enhances  the  credibility  of  the 
better  models.  The  plausibility  of  a  model  is  increased  by  adjusting 
the  weight  given  to  magnitude  estimates  to  allow  for  chance  sampling 
effects  and  the  effects  of  outlier/  influential  data  points.  The 
adjusted  magnitude  estimates  then  can  be  weighed  against  other 
criteria  (e.g.,  parsimony).  Improved  model  appraisal  practices  are 
likely  to  reduce  the  initial  interest  value  of  a  model.  Sound 
practice  will  highlight  the  existence  of  equivalent  models,  the 
potential  for  capitalization  on  chance,  and  so  forth.  Initial 
assessments  must  weigh  these  facts  against  any  novel  or  intellectually 
intriguing  element (s)  in  the  new  model.  Interest  in  the  model  will 
grow  if  it  meets  the  challenges  of  the  appraisal  process. 

The  current  state  of  the  art  poses  a  challenge.  The  power  of 
statistical  tools  for  fitting  and  refining  models  is  increasing.  This 
power  can  be  used  to  sharpen  the  process  of  principled  argument . 
Stronger  arguments  will  be  provided  if  applications  of  the  tools  are 
appropriately  sensitive  to  concerns  such  as  outliers,  influential  data 
points,  the  risk  of  capitalizing  on  chance  when  performing  multiple 
significance  tests,  and  the  distinction  between  statistical 
significance  and  explanatory  power.  The  models  produced  by  applying 
those  tools  should  be  sensitive  to  the  need  for  caution  in  causal 
inferences  and  to  the  likelihood  that  equivalent  models  may  exist. 

The  challenge  arises  because  rigorous  incorporation  of  each  of  these 
elements  into  model  construction  is  not  a  matter  of  habit  for  most 
researchers.  In  fact,  careful  implementation  of  these  desiderata 
requires  substantial  time  and  effort.  Changing  these  practices  can  be 
difficult  even  for  highly  motivated  investigators.  However,  the 
alternative  is  an  increased  risk  of  "garbage  in,  garbage  out" 
behavioral  models.  Even  if  the  principled  argument  process  ultimately 
sorts  the  good  from  the  bad,  the  sorting  process  will  be  far  more 
efficient  if  each  study  is  as  strong  as  possible. 

Help  is  available  for  the  overwhelmed  investigator.  Recent 
recommendations  for  sound  statistical  practices  (Wilkinson  &  the  Task 
Force  on  Statistical  Inference,  1999)  and  modeling  (Boomsma,  2000; 
McDonald  &  Ho,  2002)  point  to  the  most  important  tools  to  support 
improved  modeling.  These  articles  could  be  abstracted  to  provide 
checklists  that  will  help  ensure  proper  attention  to  the  appraisal 
problems  noted  here  are  properly  addressed.  Implementing  those 
recommendations  consistently  will  make  the  model  construction  process 
more  challenging  to  the  theorist  and  to  the  data  analyst.  The  effort 
will  be  repaid  by  gains  in  model  accuracy  and  credibility  and 
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increased  persuasiveness  in  the  argument  process.  These  steps  turn 
the  process  into  the  principled  argument  needed  to  generate  reliable 
knowledge . 


Searching  for  New  Perspectives 

Even  the  most  conscientious  application  of  the  methods  described 
in  the  preceding  section  may  not  produce  a  satisfactory  model  of 
behavior.  The  range  of  models  that  can  be  considered  is  limited  by 
the  variables  that  are  available  for  inclusion  in  the  models.  The 
range  may  be  limited  by  a  commitment  to  a  given  theoretical  framework. 
In  fact,  research  often  employs  paradigms  that  cannot  be  compared;  the 
result  is  behavioral  science  encompassing  several  explanations  that 
are  treated  as  equivalent  yet  mutually  exclusive  accounts  (Katzko, 
2002).  If  each  model  has  adherents,  parallel  explorations  of 
alternative  models  can  generate  a  range  of  useful  insights.  However, 
parallel  research  programs  are  more  likely  to  divide  the  research 
community  than  to  yield  a  consensus.  From  the  perspective  of  this 
chapter,  consensus  is  a  necessary  component  of  reliable  knowledge 
(Ziman,  1978) .  If  consensus  cannot  be  reached,  two  or  more  different 
predictions  could  be  made  for  the  same  event.  In  such  a  case, 
additional  work  is  needed  to  determine  which  prediction  is  correct. 
This  uncertainty  can  be  resolved  in  four  ways.  First,  one  paradigm 
can  be  adopted  as  correct  and  the  others  discarded.  Second,  the 
paradigms  can  be  shown  to  be  different  methods  of  operationalizing  the 
same  construct (s ) .  The  paradigms  now  become  auxiliary  models  that 
demonstrate  convergence  of  methods.  Third,  the  paradigms  can  be 
combined  to  provide  a  more  complete  model.  In  many  instances,  this 
step  will  be  necessary  to  replace  models  that  merely  provide 
statistically  significant  prediction  with  a  model  that  provides  a  high 
level  of  predictive  accuracy.  Fourth,  boundary  conditions  can  be 
defined  that  determine  when  each  paradigm  is  relevant  to  behavior. 
Given  these  alternatives,  the  isolated  study  of  individual  paradigms 
obviously  can  be  constructive.  However,  research  will  not  yield 
reliable  knowledge  as  defined  by  Ziman  (1978)  until  the  paradigms  are 
considered  jointly.  Direct  comparisons  are  fundamental  to  deciding 
whether  the  explanations  provided  by  difference  models  are  competitive 
or  complementary.  This  statement  is  true  no  matter  how  elegant  the 
formal  statements  and  tests  of  different  models  may  be. 

Modeling  can  reach  an  impasse  despite  the  serious  pursuit  of  the 
comparison,  contrast,  and  integration  of  different  paradigms.  The 
integration  may  produce  an  overarching  paradigm  that  includes  the  best 
elements  of  all  available  alternatives.  This  super  paradigm  still  may 
not  adequately  account  for  behavior  of  interest.  The  principled 
argument  process  can  grind  to  a  halt  if  there  is  no  method  of 
introducing  new  perspectives.  Qualitative  research  methods  and 
exploratory  data  analysis  (EDA)  are  tools  for  identifying  new 
perspectives . 

Qualitative  research  and  EDA  have  a  common  core.  Both  approaches 
search  for  patterns  in  data.  This  common  element  introduces  a 
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potential  problem.  Human  beings  are  very  good  at  perceiving  patterns 
(Gould,  2002).  The  perceived  patterns  are  translated  easily  into 
plausible  stories  of  causal  events.  However,  those  stories  may 
exclude  key  facts  to  conform  to  an  iconic  form  (Gould,  2002;  Miles  & 
Huberman,  1994).  Thus,  human  interpretive  tendencies  can  work 
against  the  search  for  a  better  understanding  of  behavior.  The  search 
for  patterns  must  include  mechanisms  to  protect  against  this 
possibility.  The  need  for  an  open  mind  is  one  underlying  theme  of  the 
discussion  of  methods  of  searching  for  new  perspectives  that  follows 
below.  The  value  of  checks  and  balances  in  the  search  is  another 
theme.  Properly  combined,  these  elements  make  qualitative  research 
and  EDA  constructive  tools  for  exploring  blind  spots  that  limit  the 
value  of  behavioral  models . 

Qualitative  Research 

Statistical  models  are  mathematical  abstractions  that  frequently 
are  interpreted  as  descriptions  of  causal  processes .  The  formal 
statements  of  these  models  appear  to  be  definitive.  A  neat  set  of 
equations  with  specific  parameter  values  replaces  the  original  data. 
This  form  of  presentation  makes  it  easy  to  forget  that  the  parameter 
values  are  only  sample  estimates,  that  all  of  the  equations  include  an 
error  component,  and  that  latent  variables  are  involved.  The  risk  of 
producing  nonsense  is  substantial  if  statistical  models  are  not 
subjected  to  serious  tests  based  on  other  methods. 

General  Approach 

Qualitative  research  provides  methods  that  can  be  used  to 
generate  initial  models  or  to  subject  existing  models  to  additional 
testing.  Qualitative  research  covers  a  wide  range  of  activities 
(Denzin  &  Lincoln,  1994).  The  focus  of  these  activities  is  the 
identification  of  patterns  in  a  set  of  observations.  The  observations 
may  be  recorded  in  notes  made  by  an  observer,  in  written  material 
produced  by  the  subjects  being  studied,  or  in  other  forms.  Matrices 
and  graphs  are  among  the  tools  commonly  used  to  identify  patterns 
(Miles  Sc  Huberman,  1994)  . 

Qualitative  methods  require  a  suitable  database.  Observations 
must  be  made  and  entered  into  databases,  usually  as  text.  The  text 
must  be  annotated  with  observer  judgments  to  identify  critical  points 
and  link  them  to  specific  sections  of  material.  The  coding  process 
may  identify  ambiguous  code  categories,  important  events  that  do  not 
fit  within  the  coding  scheme,  or  other  anomalies.  In  such  cases,  the 
investigator  must  amend  the  existing  process  and  review  the  material 
again.  Once  the  coding  process  is  complete,  the  investigator  must 
search  through  a  large  volume  of  material  to  identify  specific 
instances  of  hypothesized  associations.  The  data  must  then  be 
abstracted  to  identify  patterns  that  can  be  used  to  organize  the 
findings  (Miles  &  Huberman,  1994).  After  the  pattern  has  been 
established,  the  data  may  be  reviewed  yet  again  to  determine  whether 
anomalies  represent  coding  errors.  The  investigator  then  may  review 
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the  material  still  one  more  time  to  ensure  that  all  events  that  fit 
within  the  coding  scheme  have  been  identified.  Finally,  the  search 
might  be  followed  by  additional  searches  to  test  the  internal  logic  of 
the  existing  coding  scheme,  evaluate  tentative  inferences  drawn  from 
that  scheme,  and  identify  competing  explanations  (Miles  &  Huberman, 
1994)  .  This  general  procedure  has  been  facilitated  in  recent  years  by 
the  development  of  a  number  of  computer  programs  to  aid  in  the  process 
(cf.,  Dohan  &  Sanchez- Janowski ,  1998;  Miles  &  Huberman,  1994).  As 
yet,  there  is  no  single  best  program  or  "killer  application"  (Dohan  & 
Sanchez- Janowski,  1998).  The  basic  methods  of  qualitative  analysis 
have  made  it  difficult  to  increase  sample  sizes  in  the  past. 

Search  Methods 

The  search  for  patterns  in  qualitative  data  matrices  can  involve 
a  variety  of  heuristics.  Miles  and  Huberman  (1994,  pp.  245-277)  draw 
a  distinction  between  strategies  that  generate  meaning  and  strategies 
that  test  or  confirm  findings.  Tactics  for  generating  meaning  include 
(1)  noting  patterns  and  themes,  (2)  seeing  plausibility,  (3) 
clustering,  (4)  making  metaphors,  (5)  counting  exemplars,  (6)  making 
contrasts  and  comparisons,  (7)  partitioning  variables,  (8)  subsuming 
particulars  into  general  categories,  (9)  factoring,  (10)  noting 
(qualitative)  relations  between  variables,  and  (11)  finding 
intervening  variables.  The  products  of  these  tactics  then  are  used  to 
build  a  logical  chain  of  evidence  and  to  make  conceptual  or 
theoretical  sense  of  the  data. 

Qualitative  research  is  sensitive  to  the  potential  for  biases 
such  as  perceiving  events  as  more  patterned  than  they  actually  are 
(Miles  &  Huberman,  1994,  p.  263) .  Good  qualitative  research  includes 
checks  to  reduce  the  risk  of  bias.  Tests  include  checks  for  (1)  data 
representativeness,  (2)  researcher  effects,  (3)  methods  effects,  and 
(4)  data  weighting  effects.  Tactics  for  detecting  points  where  the 
pattern  breaks  down  include  (5)  searching  for  outliers,  (6)  examining 
extreme  cases  within  the  pattern,  (7)  reviewing  surprising  events,  and 
(8)  searching  for  data  that  are  contrary  to  the  pattern.  Explanations 
are  tested  by  (9)  making  if-then  tests,  (10)  ruling  out  spurious 
relations,  (11)  replicating  key  findings,  (12)  checking  rival 
explanations,  and  (13)  getting  feedback  from  informants. 

Formal  Analysis 

The  preceding  lists  of  exploratory  and  confirmatory  tactics 
provide  a  rough  general  picture  of  the  qualitative  research  approach. 
The  underlying  logic  of  translating  observations  into  theoretical 
statements  is  similar  to  that  applied  in  quantitative  analysis.  This 
similarity  is  even  more  pronounced  when  qualitative  researchers 
undertake  formal  qualitative  analyses.  Formal  qualitative  analysis 
techniques  do  not  yield  predictive  equations,  but  do  impose  specific 
restrictions  on  the  organization  and  interpretation  of  data  (Griffin  & 
Ragin,  1994) . 
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Formal  qualitative  analysis  techniques  commonly  focus  on 
categorical  variables.  Models  are  constructed  to  explain  why  cases 
fall  into  particular  categories  for  one  of  the  variables.  The 
explanatory  variables  in  a  study  define  a  large  matrix  in  which  each 
cell  represents  a  different  combination  of  categories.  When  all  the 
cases  in  a  cell  come  from  a  single  category  of  the  criterion  variable, 
the  combination  of  attributes  defining  the  cell  comprise  a  set  of 
conditions  that  are  sufficient  to  produce  a  case.  The  simplest 
explanatory  model  results  when  two  conditions  are  met.  First,  all  of 
the  cases  in  each  criterion  category  fall  in  a  single  explanatory 
cell.  Second,  the  explanatory  cell  is  different  for  each  criterion 
category.  When  cases  from  a  single  criterion  category  are  found  in 
more  than  one  cell,  more  than  one  causal  process  may  precede  the  same 
end  state.  A  method  such  as  qualitative  comparative  analysis  (QCA; 
cf . ,  Ragin,  1987)  can  be  used  to  formally  determine  which  explanatory 
variables  actually  are  needed  to  account  for  membership  in  each 
criterion  category. 

Qualitative  Causal  Models 

The  models  generated  by  qualitative  research  differ  from 
statistical  models  in  two  important  respects.  Qualitative  research 
models  are  based  on  formal  logic.  Causal  models  are  formulated  in 
terms  of  necessary  and  sufficient  conditions.  All  cases  demonstrating 
a  specific  profile  of  explanatory  variables  are  expected  to  be  members 
of  the  same  outcome  category.  If  an  observation  with  a  particular 
explanatory  profile  is  not  a  member  of  the  predicted  outcome  category, 
the  data  are  reviewed  to  identify  errors  in  coding.  If  the  coding  is 
correct,  a  search  for  additional  predictors  may  be  initiated.  This 
approach  contrasts  with  statistical  models  such  as  discriminant 
function  or  logl inear  analyses.  Those  methods  would  estimate  a  set  of 
probabilities  representing  the  likelihood  that  the  case  should  be 
classified  as  a  member  of  each  outcome  category.  The  case  then  would 
be  assigned  to  the  category  with  the  highest  probability.  Thus, 
qualitative  analyses  strive  for  a  definite  assignment  of  each  case 
while  quantitative  analyses  assign  cases  to  categories  based  on 
probabilities.  In  some  cases,  statistical  models  can  produce  roughly 
equal  probabilities  for  membership  in  two  or  more  categories.  The 
associated  uncertainty  is  one  difference  between  the  two  approaches . 

The  explanatory  significance  of  predictor  variables  also  differs 
between  qualitative  and  quantitative  analysis  models.  Statistical 
models  focus  primarily  on  additive  effects  of  predictors.  Most  models 
therefore  consist  of  linear  weighted  sums  of  the  predictors.  For  each 
observation,  the  probability  of  category  membership  is  increased  or 
decreased  to  some  extent  based  on  the  predictor  score.  The 
probability  estimate  is  modified  regardless  of  the  values  of  other 
predictors.  In  the  qualitative  approach,  none  of  the  predictors  has 
an  isolated  effect .  The  import  of  each  predictor  is  contingent  on  the 
values  of  other  predictors  because  a  case  is  assigned  to  a  particular 
category  only  when  the  overall  profile  of  explanatory  variables 
justifies  that  classification.  The  contingent  nature  of  the 
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relationship  between  explanatory  variables  and  category  membership 
would  imply  an  interaction  in  a  statistical  model.  A  qualitative  model 
involving  several  predictors,  therefore,  might  be  equivalent  to  a 
statistical  model  involving  higher  order  interactions.  A  qualitative 
model  with  even  a  modest  number  of  predictors  therefore  implies  a 
level  of  complexity  of  interplay  among  predictors  seldom  found  in 
statistical  models.  From  the  qualitative  perspective,  the  complexity 
is  justified  by  the  assumption  that  causal  processes  that  determine 
category  membership  represent  the  interplay  of  a  number  of  factors.  A 
theoretical  account  of  the  evidence  must  spell  out  the  contingencies 
in  this  interplay.  Interpretations  of  statistical  models  are  less 
likely  to  assume  that  the  set  of  predictors  in  the  model  define  an 
integrated  causal  process.  Instead,  those  models  are  likely  to 
interpret  category  membership  as  the  product  of  the  independent 
operation  of  a  number  of  independent  causal  processes.  Qualitative 
analyses,  therefore,  may  be  especially  useful  in  stimulating  thought 
about  the  interplay  of  causal  variables . 

Qualitative  Research  and  MAGIC 

Qualitative  research  emphasizes  two  elements  of  the  MAGIC  model 
that  are  likely  to  receive  less  attention  in  quantitative  studies. 

The  qualitative  approach  certainly  emphasizes  the  articulation  of 
causal  processes  and  their  links  to  actual  behavior  and  events.  The 
qualitative  approach  also  links  models  to  real  entities  and  events. 
This  linkage  is  likely  to  make  the  results  more  interesting  to 
consumers  of  the  model . 

Statistical  models  often  embody  very  sketchy  causal  assertions. 

A  sketchy  description  of  a  plausible  rationale  is  given  and  an 
appropriate  arrow  is  inserted  into  a  causal  graph.  One  set  of  causal 
arrows  is  preferred  if  it  reproduces  aggregated  observations  better 
than  another  set.  This  avenue  of  study  can  be  pursued  without  ever 
subjecting  the  initial  causal  assertions  to  close  scrutiny.  For 
example,  it  may  never  be  determined  whether  the  assertions  are 
plausible  for  a  single  case  considered  in  isolation. 

By  contrast,  qualitative  analysis  can  subject  causal  assertions 
to  closer  scrutiny.  Abstract  traits  are  replaced  with  specific  events 
that  often  can  be  located  in  temporal  sequences.  Serious 
consideration  may  be  given  to  alternative  causal  paths  without  the 
necessity  of  choosing  a  single  alternative.  For  example,  if  QCA 
produces  more  than  one  cell  of  "cases, "  the  result  implies  the 
existence  of  alternative  causal  models.  These  models  might  be 
represented  in  an  SEM  as  different  pathways  once  identified,  but  the 
key  problem  of  identifying  alternative  causal  patterns  would  be  more 
difficult  to  solve  in  the  usual  quantitative  analysis.  Insensitivity 
to  the  existence  of  alternative  causal  models  is  a  weakness  of  current 
practice  in  statistical  modeling. 

Qualitative  research  can  increase  the  interest  value  of  models. 
Statistical  models  in  the  behavioral  sciences  are  of  interest 
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primarily  to  narrow  research  communities.  Economics  models  are  an 
obvious  exception  to  this  statement.  In  other  areas,  the  linkage 
between  latent  traits  and  specific  types  of  behavior  may  be  too  vague 
to  interest  potential  consumers  (e.g.,  policy-makers,  clinicians, 
managers,  military  leaders).  Abstract  variables  are  of  interest  to 
these  audiences  only  when  they  map  onto  the  decision  terrain  faced  by 
the  user.  Examples  from  qualitative  research  could  help  to  define 
this  relationship  more  clearly. 

Exploratory  Data  Analysis  (EDA) 

EDA  (Tukey,  1977)  shares  a  core  element  with  qualitative 
research.  Both  approaches  are  concerned  with  exploiting  the  richness 
of  the  data.  This  concern  drives  the  view  that  models  should  reflect 
observations  made  by  the  investigator  after  a  period  of  intensive 
interaction  with  the  objects  of  study.  Both  approaches  share  a 
concern  that  routine  application  of  statistical  computer  algorithms 
will  obscure  important  aspects  of  the  data.  Thus,  both  EDA  and 
qualitative  research  emphasize  cyclical  evaluation  of  models.  Each 
cycle  involves  a  sequence  of  identifying  patterns  in  the  data, 
developing  hypotheses  to  account  for  those  patterns,  followed  by 
testing  and  revising  the  hypotheses.  The  revised  hypotheses  then  are 
the  basis  for  the  next  cycle.  The  cycle  is  repeated  until  an 
acceptable  representation  of  the  data  is  obtained.  EDA  and 
qualitative  research  differ  in  that  the  former  typically  applies  the 
observe-test-revise-test-repeat  cycle  to  quantitative  data  rather  than 
nominal  categorical  data. 

A  typical  EDA  sequence  might  be  as  follows.  A  series  of  graphic 
displays  is  examined  to  identify  general  patterns  in  the  data.  An 
initial  mathematical  model  is  formulated  as  a  first  attempt  to  capture 
the  pattern.  The  data  are  analyzed  to  estimate  parameter  values  for 
the  initial  model  and  to  compute  differences  between  the  predicted  and 
observed  values.  A  second  round  of  graphic  displays  examines  the 
residuals  from  the  initial  model  to  identify  areas  of  misfit  between 
the  model  and  the  data.  A  revised  model  is  formulated  to  account  for 
the  residuals  and  is  fitted  to  the  data.  The  cycle  is  repeated  until 
a  good  representation  of  the  data  can  be  achieved. 

The  EDA  approach  is  more  a  frame  of  mind  than  a  unique  analytic 
method.  Any  of  the  steps  described  in  the  preceding  paragraph  could 
be  included  in  a  standard  statistical  analysis.  Behrens  (1997) 
summarizes  the  key  elements  of  the  EDA  frame  of  mind  as: 

•  Understand  the  context  well  enough  to  make  informed 
decisions  given  theory  and  prior  research  findings  (p. 

135)  . 

•  Use  graphic  representations  of  the  data  to  guide  analysis 
decisions  by  looking  at  the  actual  pattern  of  data  (p. 

135)  . 
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•  Develop  models  iteratively  from  tentative  model 
specification  followed  by  residuals  assessment  (p .  139) . 

•  Use  robust /resistant  procedures  to  minimize  the  influence 
of  distributional  assumptions  (p.  143). 

•  Attend  to  outliers  not  merely  as  indications  of  problems  in 
the  research  process,  but  as  potential  indicators  of 
anomalous  phenomena  that  require  explanation  (p.  144). 

•  Re-express  the  original  scales  when  doing  so  will 
facilitate  interpretation,  promote  symmetry,  stabilize  the 
spread  of  values  within  groups  in  the  analysis,  or  promote 
straight  line  relationships  (p.  145). 

Behrens  (1997)  provides  greater  detail  on  the  preceding  points  with 
references  to  original  sources  that  explore  these  various  issues  in 
depth.  A  full  treatment  of  these  methods  is  not  possible  here,  but 
Behrens's  (1997)  general  guidelines  are  directly  related  to  issues 
discussed  earlier  in  this  chapter.  Understanding  context  is  related 
to  the  idea  that  one  should  not  suspend  judgment  when  analyzing  data. 
When  prior  findings  contribute  to  context,  these  admonitions  share  the 
spirit  of  strong  significance  tests  because  prior  findings  replace  the 
null  hypothesis  as  a  research  field  matures.  The  admonition  to 
develop  models  iteratively  is  implicitly  related  to  parsimony  because 
it  is  based  on  fitting  a  simple  model  to  data.  New  parameters  are 
added  only  if  they  predict  the  residuals.  Stepwise  regression,  EFA, 
and  other  analyses  begin  with  simple  models  then  extend  them  if  adding 
more  predictors  or  more  factors  will  provide  a  better  account  of  the 
variance.  These  procedures  provide  iterative  models,  but  the  methods 
are  constrained  by  statistical  criteria  (i.e.,  maximize  the  variance 
explained)  rather  than  theoretical  or  empirical  context  and  the 
judgment  of  the  researcher.  The  emphases  on  robust  procedures  and 
outliers  directs  attention  to  the  need  to  develop  models  that 
accurately  predict  behavior  in  most  of  the  people  most  of  the  time. 
Both  of  these  elements  of  EDA  could  be  useful  to  identify  exceptional 
groups  of  observations.  If  there  are  no  obvious  errors  in  the  data, 
these  groups  might  become  the  basis  for  hypotheses  that  could  be 
tested  later  using  taxometric  methods.  Finally,  the  emphasis  on 
interpretation  is  a  reminder  that  statistical  summaries  are  not  the 
endpoint  for  analysis.  The  data  must  be  interpreted  in  ways  that  link 
them  to  actual  behavior  and  to  theory. 

State  of  the  Art 

Several  factors  make  it  likely  that  there  will  be  dramatic 
improvements  in  behavioral  models  in  the  near  future.  First,  the 
development  of  computer  hardware  and  software  has  reduced  barriers  to 
incorporating  advanced  procedures  into  research.  Today's  programs 
routinely  include  simple  methods  of  specifying  models  for  analysis. 
Examples  are  drop-down  menus  and  graphic  interfaces .  Database 
translation  programs  make  it  possible  to  format  data  in  almost  any 
familiar  form  and  import  it  into  a  new  program.  Thus,  it  is  no  longer 
necessary  to  master  a  complex  computer  syntax  that  is  specific  to  a 
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particular  computer  program  with  a  limited  range  of  analytic 
functions.  For  a  modest  cost,  every  researcher  can  have  ready  access 
to  each  type  of  analysis  discussed  in  this  chapter.  In  fact, 
researchers  in  large  organizations  are  likely  to  have  access  to 
several  different  programs  that  implement  the  most  widely  used 
methods . 

A  reduced  risk  of  "garbage  in,  garbage  out"  analysis  is  a  second 
positive  factor.  The  increased  integration  of  different  analysis 
procedures  under  the  heading  of  GLM  or  EM  maximum  likelihood  methods 
makes  it  clear  that  different  models  are  manifestations  of  general 
principles.  Long  (1997)  provides  a  fundamental  expression  of  this 
point  through  his  observation  that  linking  functions  translate 
different  CLDV  analyses  into  familiar  linear  regression  models.  Long 
further  notes  that  techniques  learned  in  the  more  familiar  linear 
regression  context  apply  to  procedures  such  as  logit,  probit,  and 
logistic  regression  analyses.  Analogous  situations  can  be  identified 
in  SEM  and  other  procedures.  Recognizing  and  capitalizing  on  these 
similarities  reduces  the  learning  curve  required  for  effective  use  of 
new  procedures . 

There  is  movement  toward  confirmatory  methods .  Confirmatory 
methods  encourage  explicit  theoretical  statements  by  making  it 
possible  to  impose  constraints  on  a  model.  This  aspect  of  analysis  is 
not  new.  For  example,  many  researchers  have  conducted  analyses  that 
forced  the  entry  of  predictors  into  a  regression  equation.  However, 
the  range  of  analyses  that  can  be  conducted  with  constraints  that 
specify  precise  values  for  model  parameters  now  extends  to  factor 
analysis  (i.e.,  CFA) ,  cluster  analysis  (i.e.,  EM  mixture  analysis), 
and  substantive  models  (e.g.,  SEM).  Meta-analysis  provides  tools  for 
developing  parameter  estimates  based  on  the  cumulative  research 
record.  The  process  of  thinking  through  the  potential  constraints 
encourages  better  articulation  of  the  relationship  between  the  model 
and  theory.  At  the  same  time,  a  good  fit  between  a  highly  constrained 
model  and  the  data  provides  the  convergence  between  predictions  and 
evidence  that  indicates  a  strong  theory. 

The  identification  of  important  blind  spots  in  traditional 
research  practices  is  another  positive  development.  The  most  critical 
blind  spot  is  the  tendency  toward  confirmation  bias.  The  demonstrable 
existence  of  equivalent  models  is  the  strongest  argument  for  giving 
careful  attention  to  this  problem.  However,  acknowledging 
confirmation  bias  also  directs  attention  to  other  important  problems. 
Parsimony,  an  accepted  desideratum  for  sound  theories,  comes  into  view 
when  it  is  recognized  that  nearly  equivalent  models  may  exist  that 
involve  fewer  parameters.  Recognition  of  confirmation  bias  can  also 
lead  to  more  frequent  comparisons  of  models  based  on  different 
research  paradigms  (Katzko,  2002)  .  One  intriguing  issue  here  is  that 
syndrome  models,  which  generate  an  overall  model  by  simply  collecting 
and  enumerating  specific  paradigms,  may  prove  defensible  on  further 
analysis.  Conceivably,  this  approach  is  the  modeling  equivalent  of 
taxometric  definitions  of  mental  health  syndromes.  However,  as  Meehl 


59 


Statistics  and  Model  Construction 


(1992)  has  pointed  out,  the  existence  of  a  typology  is  a  testable 
hypothesis,  not  something  to  be  established  by  fiat.  Where  multiple 
models  or  paradigms  are  not  currently  available,  tools  for  searching 
for  alternative  models  are  available  (e.g.,  qualitative  analysis,  EDA, 
TETRAD).  Spirtes,  Richardson,  Meek,  Scheines,  and  Glymour  (1998) 
argue  that  a  serious  search  for  alternative  models  should  be 
undertaken  prior  to  conducting  any  analysis. 

Recent  publication  guidelines  for  statistical  practices  should 
encourage  improved  modeling  practices.  The  availability  of  these 
guidelines  indicates  that  practice  has  matured  to  produce  at  least  a 
broad  general  consensus  on  methods .  The  consensus  includes 
recommendations  on  the  general  problems  of  statistical  inference 
(Wilkinson  &  the  Task  Force  on  Statistical  Inference,  1999)  and 
recommendations  for  standard  reporting  in  SEM  (Boomsma,  2000;  McDonald 
&  Ho,  2002).  The  SEM  guidelines  may  be  particularly  useful  for 
modeling.  The  general  steps  that  are  outlined  can  be  adapted  to 
almost  any  analysis,  particularly  those  involving  the  imposition  of 
constraints  on  model  parameters  and  structure.  Diagnostic  tools  are 
available  at  various  steps  in  the  process  to  assess  potential 
weaknesses  of  existing  models.  These  tools  include  methods  of 
identifying  outlier/influential  data  points  and  searching  for 
alternative  models.  Systematic  application  of  these  tools  will  reduce 
the  likelihood  that  models  will  be  affected  by  blind  spots  in  the 
conceptual  model  under  investigation  or  quirky  elements  of  the  data 
being  analyzed.  Applications  that  embody  the  test-and- revise  spirit 
of  EDA  are  likely  to  be  particularly  fruitful. 

The  state  of  the  art  is  itself  an  example  of  principled  argument. 
Progress  has  been  made  in  some  areas,  but  consensus  has  not  been 
reached  on  all  aspects  of  modeling.  Methods  of  appraising  and 
amending  models  are  in  a  state  of  flux.  Significance  testing  is 
becoming  less  important  in  SEM,  but  it  continues  to  be  the  primary 
tool  for  model  assessment  in  other  areas  (e.g.,  CLDV  analyses).  Even 
in  the  SEM  domain,  consensus  is  only  qualitative  in  some  areas.  For 
example,  no  consensus  has  been  reached  regarding  the  best  GFI  to  use. 
Hu  and  Bentler's  (1998,  1999)  two-indicator  approach  probably 
approximates  the  current  consensus  in  this  area  with  SRMR  and  RMSEA  as 
a  workable  combination.  These  indices  reflect  the  misfit  and  PRE  of 
the  model,  respectively,  and  appear  to  be  sensitive  to  mistakes  in 
both  the  measurement  model  and  path  model .  Uncertainty  in  this  area 
illustrates  an  issue  that  is  likely  to  be  important  in  SEM  in  the  near 
future.  The  ongoing  controversy  over  significance  testing  directs 
attention  to  the  potential  use  of  multiple  criteria  in  other  areas. 
Given  multiple  criteria,  it  is  reasonable  to  expect  future  work  to 
address  the  problem  of  how  best  to  combine  alternative  criteria.  At 
present,  the  issue  of  selecting  and  weighting  indicators  of  model 
adequacy  is  a  judgment  call  for  the  researcher.  At  the  same  time, 
there  is  clear  evidence  of  movement  away  from  relying  on  significance 
testing  as  the  primary  method  of  model  evaluation. 
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It  is  not  clear  at  this  time  whether  the  PRE  approach  should  be 
extended  to  all  types  of  models.  For  example,  should  this  criterion 
be  applied  in  the  study  of  CLDV?  The  research  tradition  in  areas  of 
study  using  these  tools  has  emphasized  significance  testing  rather 
than  incremental  fit  as  the  primary  basis  for  choosing  between  models 
Procedures  such  as  Kaplan’s  (1990a,  1990b)  coirbination  of  modification 
indices  and  expected  parameter  change  provide  an  alternative 
perspective  on  post  hoc  model  modification.  However,  investigators 
must  be  sensitive  to  the  risk  of  producing  a  complex  model  that  merely 
capitalizes  on  chance  (Green  et  al . ,  2001;  Steiger,  1990) .  Responses 
to  Kaplan’s  (1990a)  suggestions  included  the  recommendation  that 
modifications  should  not  be  introduced  without  adequate  theoretical 
justification  (Bollen,  1990) .  Kaplan  (1990b)  concurred  with  this 
recommendation,  but  even  this  criterion  may  be  inadequate.  Steiger 
(1990)  posed  the  question,  "What  percentage  of  researchers  would  ever 
find  themselves  unable  to  think  up  a  theoretical  justification  for 
freeing  a  parameter?"  (p.  175,  italics  in  the  original).  Note  that 
this  question  was  posed  in  the  context  of  post  hoc  modifications 
rather  than  a  priori  specification.  Even  with  such  justification, 
modifications  should  be  examined  in  a  new  sample  of  data  to  verify 
that  they  are  productive.  In  connection  with  this  point,  Steiger 
(1990,  p.  176)  also  noted  emphatically,  "An  ounce  of  replication  is 
worth  a  ton  of  inferential  statistics"  (italics  in  the  original). 
Increasing  use  of  replication  will  likely  be  a  trend  in  the  future. 

One  reason  is  that  increased  use  of  bootstrapping  and  other  resampling 
methods  provides  methods  of  pursuing  this  end  without  radically 

increasing  the  volume  of  data  needed  in  the  modeling  process  (Wilcox, 
1998)  . 


The  development  of  a  broader  perspective  on  research  programs  may 
become  a  growth  area  in  the  future.  Qualitative  research  and  EDA  have 
been  examined  here  as  potential  methods  of  avoiding  confirmation  bias. 
Qualitative  analysis  can  be  an  end  in  its  own  right,  but  this  general 
approach  also  has  the  potential  to  stimulate  the  formulation  of  new 
models.  QCA  is  interesting  as  a  means  of  identifying  causally 
relevant  variables  that  could  be  incorporated  into  models.  QCA  also 
is  a  stimulus  to  rethinking  a  problem  because  it  embodies  a  different 
concept  of  causation  than  is  found  in  SEM,  for  example.  EDA  and 
TETRAD  provide  additional  tools  for  using  data  to  generate  causal 
models.  The  use  of  these  tools  is  important  as  an  antidote  to  the 
confirmation  bias  that  occurs  when  a  moderately  good  fit  to  the  data 
is  interpreted  as  sufficient  justification  to  accept  a  model  specified 
at  the  outset.  Perhaps  a  careful  wedding  of  qualitative  analysis  to 
quantitative  analyses  would  help  overcome  the  resistance  to 

?nnohatlVe-, reSSarCh  in  SOirie  dornains  (e.g.,  psychology  journals;  Kidd, 
2002).  Explication  of  the  limitations  of  standard  statistical 
procedures  as  model  generating  tools  coupled  with  careful 
demonstration  of  the  checks  and  balances  involved  in  proper  causal 
inference  from  qualitative  data  could  be  critical  to  making  a  better 
case  for  combining  the  two  approaches  when  constructing  models.  The 
development  of  methods  of  determining  whether  a  qualitative  model  is 
superior  to  a  quantitative  model  in  some  situations  may  be  an 
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important  topic  for  future  work.  For  example,  a  qualitative  model 
might  identify  a  change  of  circumstances  in  an  operational  setting 

signals  the  need  to  switch  causal  models.  A  model  appropriate  to 
the  new  circumstances  would  replace  the  quantitative  causal  model 
currently  in  use.  The  rapid  growth  of  techniques  for  data  mining  from 
documents  and  other  sources  could  facilitate  the  application  of 
qualitative  analysis  to  this  type  of  problem  in  the  near  future. 

This  description  of  the  state  of  the  art  is  certainly  incomplete. 
Attention  has  been  limited  largely  to  tools  routinely  used  in 
psychology  and  sociology.  Developments  in  other  domains  may  provide 
tools  that  can  be  very  useful.  For  example,  time-series  analysis  was 
developed  primarily  in  the  realm  of  econometrics.  The  absence  of  any 
consideration  of  how  to  translate  statistical  models  into  dynamic 
simulations  for  forecasting  is  another  potential  problem.  At  a 
minimum,  this  translation  will  have  to  include  distributions  of 
parameter  values  in  addition  to  functional  relationships  between 
variables.  Distributions  must  be  considered  to  generate  samples  of 
entities  (i.e. ,  individuals,  groups)  that  will  perform  the  actions 
that  lead  to  the  forecast.  Techniques  such  as  HLM  are  useful  in  this 
regard  because  they  provide  estimates  of  variances.  However,  the 
actual  process  of  translating  those  estimates  into  simulations  may  be 
more  difficult  than  it  first  appears. 

Finally,  the  potential  for  different  categories  of  models  has 
been  referred  to  previously  only  in  passing.  Models  such  as  game 
theory  may  not  translate  readily  into  quantitative  terms.  The  problem 
of  how  to  construct  hybrid  models  that  integrate  quantitative  and 
qualitative  differences  in  dynamic  representations  may  be  a  major 
challenge  for  the  future. 

Computer  Programs  and  Specific  Implementations 

This  chapter  has  not  discussed  specific  software  programs  for 
implementing  state-of-the-art  analyses.  However,  almost  every 
Procedure  referred  to  in  this  chapter  can  be  implemented  using  several 
statistical  packages.  The  programs  often  reduce  the  problem  of 
specifying  a  model  to  simple  activities  such  as  drawing  a  picture  or 
filling  in  boxes  on  a  pop-up  computer  menu.  The  basic  problem  of  how 
to  implement  these  advanced  methods  therefore  reduces  to  choosing  an 
appropriate  program  and  specifying  the  model  of  interest.  The  full 
range  of  analysis  problems  that  can  be  addressed  by  these  means  cannot 
be  described  because  computer  programs  are  being  revised  and  updated 
so  rapidly.  Even  relatively  inexperienced  investigators  can  apply 
advanced  methods  effectively  when  guided  by  recent  recommendations 
regarding  statistical  practices  (e.g.,  Behrens,  1997;  Boomsma,  2000; 
McDonald  &  Ho,  2002;  Wilkinson  &  the  Task  Force  on  Statistical 
Inference,  1999). 

An  informal  survey  of  advanced  data  analysis  packages  suggests 
five  trends  in  the  development  of  computer  analysis  tools.  First, 
newer  programs  emphasize  model  testing  and  comparison.  Program  iAput 
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includes  equations  that  define  a  model  to  be  fitted  to  the  data.  The 
program  then  estimates  parameter  values.  Program  output  typically 
includes  an  overall  measure  of  fit  between  the  model  and  the  data 
(e.g.,  a  maximum  likelihood  x2)  •  Second,  newer  programs  capitalize  on 
the  fact  that  many  nonlinear  models  can  be  transformed  to  linear 
models  (cf.,  Long,  1997).  Thus,  a  single  program  fits  models  that 
appropriate  expressions  for  continuous  and  discrete  variables.  Third, 
programs  increasingly  accommodate  different  combinations  of  continuous 
and  discrete  variables.  Either  type  of  variable  may  appear  as  a 
predictor  or  a  dependent  variable  in  equation  form.  Even  latent 
variables  can  be  continuous  or  discrete  (Magidson  &  Vermunt,  2001, 
2002).  Fourth,  programs  are  more  likely  to  provide  simple  methods  of 
cross-validating  models.  Some  programs  provide  options  that 
automatically  divide  the  data  into  calibration  and  cross-validation 
samples.  Fifth,  the  range  of  graphic  display  methods  is  increasing. 
This  trend  makes  it  easier  to  apply  EDA  principles  during  data 
analysis . 

These  trends  provide  better  tools  for  solving  the  difficult 
problem  of  moving  from  verbal  statements  to  mathematical  models 
evaluated  by  tests  of  fit  to  the  data  rather  than  by  null  hypothesis 
tests.  Wider  use  of  these  tools  will  surely  help  to  define 
consensible  facts  (Ziman,  1978)  by  clearly  articulating  the  links 
between  data  measurements  and  constructs  and  imposing  theoretically 
derived  values  on  the  data.  Stronger  consensus  should  also  be 
fostered  by  the  direct  comparison  of  alternative  models  conducted  in 
the  context  of  metatheoretical  criteria  for  model  choice  (e.g., 
parsimony)  . 

Specific  programs  suitable  for  addressing  a  particular  problem 
can  be  identified  several  ways.  A  review  of  the  related  research 
literature  can  identify  programs  used  in  prior  work.  Specialized 
journals  (e.g..  Structural  Equation  Modeling)  often  provide  examples 
of  different  programs.  Methodological  and  statistical  journals  review 
books  and  computer  programs  that  describe  specific  programs  and  often 
contrast  a  given  product  with  competitors  (e.g.,  Journal  of  the 
American  Statistical  Association,  British  Journal  of  Mathematical 
Psychology,  and  Educational  and  Psychological  Measurement) .  Internet 
searches  can  identify  programs  for  general  types  of  analysis.  For 
example,  at  the  time  of  this  writing,  a  search  for  "latent  class 
analysis"  identified  a  site  listing  15  computer  programs  that  would 
perform  this  procedure.  Similarly,  a  search  for  "cluster  analysis" 
identified  several  programs  that  implement  the  multivariate  mixture 
approach  described  by  Fraley  and  Raftery  (2002) . 

Guidance  on  specific  analysis  problems  is  available  in  many 
cases.  Program  documentation  now  routinely  supplements  written 
manuals  with  computerized  tutorials  and  application  examples. 

Textbooks  that  describe  the  underlying  statistical  models,  the 
development  of  the  analysis  methods,  and  application  examples  are 
available  and,  in  some  cases,  specifically  linked  to  particular 
programs  (e.g.,  McCutcheon,  1987;  Raudenbush  &  Bryk,  2002;  Waller  & 
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Meehl,  1998) .  Texts  on  general  topics  such  as  SEM  are  widely 
available,  but  care  is  needed  when  choosing  a  text  to  ensure  that  it 
covers  critical  issues  (Steiger,  2001)  .  Journal  articles  often 
include  appendices  giving  the  command  syntax  for  specific  methods  or 
models.  Such  appendices  are  common,  for  instance,  in  Structural 
Equation  Modeling  and  Psychological  Methods  articles.  Internet  sites 
for  user  groups  include  bulletin  boards  for  seeking  expert  advice  on 
specific  problems  (e.g.,  SEMNET) .  These  resources  are  helping  to 
break  down  barriers  to  the  use  of  modern  analysis  procedures,  thereby 
providing  tools  for  more  focused  tests  of  hypotheses. 

The  range  of  analytical  options  is  daunting  and  may  even  be 
intimidating.  Most  people  will  experience  a  natural  tendency  to  cling 
to  familiar  methods  that  provide  reasonably  sound  answers  to  their 
questions  and  permit  work  to  move  forward  in  an  orderly  fashion.  This 
reaction  should  be  tempered  by  the  fact  that  the  temptation  is  shared 
with  most  other  colleagues.  At  present,  the  average  researcher  is  not 
fully  prepared  to  exploit  all  analytical  opportunities  (Tinsley  & 
Brown,  2000b) ,  but  this  situation  is  neither  new  nor  an  insurmountable 
impediment  to  progress.  As  Berk  (1997,  p.  xxi)  notes  in  his 
introduction  to  Long's  (1997)  description  of  CLDV,  "For  most  of  the 
procedures  discussed...there  exist  statistical  routines  in  all  of  the 
major  statistical  packages.  This  is  both  a  blessing  and  a  curse.  The 
blessing  is  that  minimal  computer  skills  are  required.  The  curse  is 
that  minimal  computer  skills  are  required.  Right  answers  and  wrong 
answers  are  easy  to  obtain."  However,  if  researchers  remember  that 
"No  statistical  procedure  should  be  treated  as  a  mechanical  truth 
generator "  (Meehl,  1992,  p.  152,  italics  in  the  original),  progress 
toward  Ziman's  (1978)  goal  of  consensus  should  be  more  rapid  in  the 
future  than  it  has  been  in  the  past.  In  the  end,  investigators  who 
invest  the  time  to  familiarize  themselves  with  newer  techniques  will 
be  repaid  by  substantial  gains  in  their  ability  to  derive  more 
definite  answers  to  their  research  questions. 
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